bioinformatics bio314madhusudhan/bio314_2017/bio_314_lecture_1… · r. durbin, s. r. eddy, a....
TRANSCRIPT
Bioinformatics Bio314Sequence alignments
- dynamic programming
- local; global; different gap penalties
Substitution matrices
- Construction; different types
Multiple sequence alignments
Phylogeny and clustering
Markov models
- Hidden Markov model algorithms
Motif finding
Heuristic Alignments
Structural biology/bioinformatics
- Secondary structure prediction
- Neural networks
- 3D structure modeling
- Structural analysis
- Molecular dynamics simulations
- drug design
Next generation sequencing1
Bio314 Evaluation/assessment
Mid-sem exam 30%
End-sem exam 30%
Quizzes 15%
Assignments 25%
2
Reference material
R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge University Press, 1999), ISBN 0521629713
Arthur Lesk, Introduction to Bioinformatics (Oxford University Press), ISBN-13: 978-0199208043
David Mount, Bioinformatics: Sequence and Genome Analysis (CSHL press, 2004), ISBN-13: 978-0879697129
Other books (names will be mentioned later in class).
Several research papers + web sites + web servers.
3
What is Bioinformatics?
Representation
Scoring
Sampling (Optimization)
4
The central dogma
source: wikipedia
5
Evolution at the DNA level
6
AGCCTAGACAGTTAAG---AGACGGTTA
inversiontranslocationduplication
insertion mutation
Sequence Alignments
Important in
- Deducing evolutionary relationships
- Function annotation
- Identifying important regions
7
Aligning pairs of sequences
Given two strings
X = X1,X2,X3….XM
Y = Y1,Y2,Y3….YN
An alignment is an assignment of gaps to positions 0,….M in X and 0,….N in Y, so as to line up each letter in one sequence with either a letter or a gap in the other sequence.
8
Why sequence alignments?
Sequence similarity often implies structural functional relationship
New sequences emerge by
- insertion/deletion
- substitutions
Similarity between 2 sequences is assessed by an alignment
9
References
R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge University Press, 1999), ISBN 0521629713
Arthur Lesk, Introduction to Bioinformatics (Oxford University Press), ISBN-13: 978-0199208043
Papers/articles: will be distributed at appropriate times
10
Alignment examples
11
HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL HBB_HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL
HBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D—-DMPNALSALSDLHAHKLLGB2_LUPLU NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG
HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKLF11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE
MatchesMismatchesGaps (Insertions/deletions)
The alignment scheme
Representation
Scoring
- substitution scores
- gap penalties
sampling/optimization
- Enumerate all possible alignments?
- dynamic programming
12
Alignment by enumeration
Too many to compute/enumerate!
13
GAATC CATAC GAATC- CA-TAC
GAAT-C C-ATAC GAAT-C CA-TAC
-GAAT-C C-A-TAC GA-ATC CATA-C
Substitution scores
S(a,b) = Score for substituting residue ‘a’ by residue ‘b’
S(a,b) for proteins is a 20 X 20 matrix
How do we get the values of this matrix?
- Popular matrices = PAM250, BLOSUM62, BLOSUM50...
14
)log(),(ba
ab
qqpbas =
Gap penalties
Gaps are penalized
Linear gap penalty: all gaps are penalized with equal penalty d
Affine gap penalty:
- Distinguish between opening and extension of gaps
- penalty for gap opening larger than for gap extension
15
HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKLF11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE
Dynamic ProgrammingOptimization problem: Find optimal alignment by maximizing score
Optimal alignments are obtained from optimal sub-alignments
The alignment of
X1,X2......Xi
Y1,Y2......Yj
can be constructed from the 3 sub-alignments:
X1,X2......Xi-1
Y1,Y2......Yj-1
X1,X2......Xi-1
Y1,Y2......Yj
X1,X2......Xi
Y1,Y2......Yj-116
Global and local alignments
2 flavours of dynamic programming
- Global Needleman-Wunsch [1970], Peter Sellers[1974]
LGPSTKDFGKISESREFDN LNQLERSFGKINMRLEDA
- Local Smith-Waterman [1981]
-------FGKI---------------FGKI-------Overhangs are not penalized.
17
Global dynamic programmingAlign the 2 sequences X and Y
X: HEAGAWGHEEY: PAWHEAE
18
6 6 0 -3 -3 -1 -3 -1 6 0 E
-1 -1 -2 0 -3 5 0 5 -1 -2 A
6 6 0 -3 -3 -1 -3 -1 6 0 E
0 0 10 -2 -3 -2 -2 -2 0 10 H
-3 -3 -3 -3 15 -3 -3 -3 -3 -3 W
-1 -1 -2 0 -3 5 0 5 -1 -2 A
-1 -1 -2 -2 -4 -1 -2 -1 -1 -2 P
E E H G W A G A E H
BLOSUM50 substitution matrix
Linear gap penalty = -8
X X X X . . .Y Y Y . . .
A compact representation
19
XXXXXXXXXXXXX--YYYYYYYYYYYYYYY
XXXX--XXXXXXXXX-YYYYYYYYYY--YYYY
All possible alignments can be enumerated in this matrix representation
The alignment schemeRepresentation
Scoring
- substitution scores
- gap penalties
sampling/optimization
- dynamic programming
- Compute scoring matrix
- Recurrence formula
- Traceback20
Global dynamic programming: recurrence relationship
21
X1,X2......Xi-1
Y1,Y2......Yj-1
X1,X2......Xi
Y1,Y2......Yj-1
X1,X2......Xi-1
Y1,Y2......Yj
X1,X2......Xi
Y1,Y2......Yj
F(i-1,j-1) F(i,j-1)
F(i-1,j) F(i,j)
+S(Xi,Yj)
-d
-d
Global dynamic programming: recurrence relationship
F(i,j) = Maximum of
F(i-1,j-1) + S(Xi,Yj)
F(i,j-1) -d
F(i-1,j) -d
Base conditions:
F(i,0) = -i * d
F(0,j) = -j * d
22
{
Global dynamic programming: Scoring Matrix computation
H E A G A W G H E E0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60
W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37
H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19
E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5
A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2
E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1
! F(1,0) + s(-,P) = -8 + (-8) = -16F(1,1) = max F(0,0) + s(H,P) = 0 + (-2) = -2! F(0,1) + s(H,-) = -8 + (-8) = -16
! F(2,0) + s(-,P) = -16 + (-8) = -24!F(2,1) = max F(1,0) + s(E,P) = -8 + (-1) = -9! F(1,1) + s(E,-) = -2 + (-8) = -10
F(i,j-1) + s(-,yj)F(i,j) = max F(i-1,j-1) + s(xi,yj)! F(i-1,j) + s(xi,-)
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
0
0
?
Global dynamic programming: Traceback
H E A G A W G H E E0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60
W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37
H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19
E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5
A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2
E -54 -38 -24 -11 -6 -12 -14 -15 -12 -9 1
G A W G H E - E- A W – H E A E
1 2 3 4 5 6 7 8 9 100
0
1
2
3
4
5
6
7 1
-5
3
-3
-13-5
-20
-25-17
-16-80
H-
E-
AP
Global dynamic programming: Traceback
H E A G A W G H E E0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60
W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37
H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19
E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5
A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2
E -54 -38 -24 -11 -6 -12 -14 -15 -12 -9 1
G A W G H E - E- A W – H E A E
1 2 3 4 5 6 7 8 9 100
0
1
2
3
4
5
6
7 1
-5
3
-3
-13-5
-20
-25-17
-16-80
H-
E-
AP
Dynamic programming: computational complexity
• F-matrix computation:– We store (n+1) x (m+1) numbers– Each number involves a constant number of computations
• Three sums and a maximum
• • Traceback:
– O(m+n) time
• Complexity of the algorithm:– O(mn) time– O(mn) memory
For sequences of comparable lengths, the dynamic programming algorithm is O(n2)– Computationally tractable
26
Global and local alignments
2 flavours of dynamic programming
- Global Needleman-Wunsch [1970], Peter Sellers[1974]
LGPSTKDFGKISESREFDN LNQLERSFGKINMRLEDA
- Local Smith-Waterman [1981]
-------FGKI---------------FGKI-------Overhangs are not penalized.
27
Local alignments
Global Alignment shortcomingsEffective when the two sequences have detectable sequence similarity over their entire lengths
It could be inaccurate when we are only interested in finding best alignment between subsequences
Local AlignmentsEffective when the two sequences have detectable sequence similarity over their entire lengths
Problem: Given two sequences X and Y, find subsequences α and β of X and Y, respectively, whose alignment score is maximum over all pairs of subsequences from X and Y
28
Local dynamic programming: recurrence relationship
F(i,j) = Maximum of
F(i-1,j-1) + S(Xi,Xj)
F(i,j-1) -d
F(i-1,j) -d
Base conditions:
F(i,0) = -i * d
F(0,j) = -j * d
29
0
0
0 0
0
{
Smith, T. F. and Waterman, M. S. 1981. Journal of Molecular Biology 147:195-197
Local dynamic programming
30
H E A G A W G H E E0 0 0 0 0 0 0 0 0 0 0
P 0
A 0
W 0
H 0
E 0
A 0
E 0
H E A G A W G H E E0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 5 0 0 0 0 0
W 0 0 0 0 2 0 20 12 4 0 0
H 0 10 2 0 0 0 12 18 22 14 6
E 0 2 16 8 0 0 4 10 18 28 20
A 0 0 8 21 13 5 0 4 10 20 27
E 0 0 6 13 18 12 4 0 4 16 26
Smith-Waterman algorithm
H E A G A W G H E E0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 5 0 0 0 0 0
W 0 0 0 0 2 0 20 12 4 0 0
H 0 10 2 0 0 0 12 18 22 14 6
E 0 2 16 8 0 0 4 10 18 28 20
A 0 0 8 21 13 5 0 4 10 20 27
E 0 0 6 13 18 12 4 0 4 16 26
28
22
1220
5
0
AA
WW
G-
HH
EE
Local dynamic programming: Traceback
H E A G A W G H E E0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 5 0 0 0 0 0
W 0 0 0 0 2 0 20 12 4 0 0
H 0 10 2 0 0 0 12 18 22 14 6
E 0 2 16 8 0 0 4 10 18 28 20
A 0 0 8 21 13 5 0 4 10 20 27
E 0 0 6 13 18 12 4 0 4 16 26
28
22
1220
5
0
AA
WW
G-
HH
EE
Local dynamic programming: Traceback
Global vs Local dynamic programming
34
Recurrence relationship Similar in both casesSimilar in both cases
Traceback Start with the last value Start with the highest value
Overhangs Penalized Not-penalized
Output Optimal alignment between whole sequences
Optimal alignment between sub-sequences
Sensitivity Detecting sequence with high similarity
Better at detecting remote similarities
Computational Complexity O(n*n) in time and memoryO(n*n) in time and memory
Gap penalties
Gaps are penalized
Linear gap penalty: all gaps are penalized with equal penalty d
Affine gap penalty:
- Distinguish between opening and extension of gaps
- penalty for gap opening larger than for gap extension
35
HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKLF11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE
Affine gap penaltiesGap penalty: G = U + k*L
- U = gap open penalty; k= gap extension penalty; L = gap length
Base conditions -
36
H E A G A W G H E E0 U U+k U+2k U+3k U+4k U+5k U+6k U+7k U+8k U+9k
P U
A U+1k
W U+2k
H U+3k
E U+4k
A U+5k
E U+6k
Global dynamic programming: recurrence relationship
37
X1,X2......Xi-1
Y1,Y2......Yj-1
X1,X2......Xi
Y1,Y2......Yj-1
X1,X2......Xi-1
Y1,Y2......Yj
X1,X2......Xi
Y1,Y2......Yj
Affine gap penalties: recurrence relationships
38
X1 X2 . . . . . Xi-1 Xi
Y1 Y2 . . . . . Yj -
X1 X2 . . . . . Xi-1 Xi
Y1 Y2 . . . . Yj - -
X1 X2 . . . . . Xi-1 Xi
Y1 Y2 . . . . . Yj-1 Yj
F(i,j) = V(i-1,j-1) + S(Xi,Yi)
G(i,j) = V(i-1,j) - U
G(i,j) = G(i-1,j) - k*1
Affine gap penalties: recurrence relationships
39
F(i,j) = V(i-1,j-1) + S(Xi,Yi)
H(i,j) = V(i,j-1) - U
H(i,j) = H(i,j-1) - k*1
X1 X2 . . . . . Xi-1 -
Y1 Y2 . . . . . Yj Yj
X1 X2 . . . . . - -
Y1 Y2 . . . . . Yj-1 Yj
X1 X2 . . . . . Xi-1 Xi
Y1 Y2 . . . . . Yj-1 Yj
Affine gap penalties: recurrence relationships
V(i,j) = max{F(i,j), G(i,j), H(i,j)}
F(i,j) = V(i-1,j-1) + S(Xi,Yj)
G(i,j) = max
H(i,j) = max
40
V(i-1,j) - UG(i-1,j) - k*1
V(i,j-1) - UG(i,j-1) - k*1
{
{