bioinformatics bio314madhusudhan/bio314_2017/bio_314_lecture_1… · r. durbin, s. r. eddy, a....

Bioinformatics Bio314Sequence alignments

- dynamic programming

- local; global; different gap penalties

Substitution matrices

- Construction; different types

Multiple sequence alignments

Phylogeny and clustering

Markov models

- Hidden Markov model algorithms

Motif finding

Heuristic Alignments

Structural biology/bioinformatics

- Secondary structure prediction

- Neural networks

- 3D structure modeling

- Structural analysis

- Molecular dynamics simulations

- drug design

Next generation sequencing1

Bio314 Evaluation/assessment

Mid-sem exam 30%

End-sem exam 30%

Quizzes 15%

Assignments 25%

2

Reference material

R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge University Press, 1999), ISBN 0521629713

Arthur Lesk, Introduction to Bioinformatics (Oxford University Press), ISBN-13: 978-0199208043

David Mount, Bioinformatics: Sequence and Genome Analysis (CSHL press, 2004), ISBN-13: 978-0879697129

Other books (names will be mentioned later in class).

Several research papers + web sites + web servers.

3

What is Bioinformatics?

Representation

Scoring

Sampling (Optimization)

4

The central dogma

source: wikipedia

5

Evolution at the DNA level

6

AGCCTAGACAGTTAAG---AGACGGTTA

inversiontranslocationduplication

insertion mutation

Sequence Alignments

Important in

- Deducing evolutionary relationships

- Function annotation

- Identifying important regions

7

Aligning pairs of sequences

Given two strings

X = X1,X2,X3….XM

Y = Y1,Y2,Y3….YN

An alignment is an assignment of gaps to positions 0,….M in X and 0,….N in Y, so as to line up each letter in one sequence with either a letter or a gap in the other sequence.

8

Why sequence alignments?

Sequence similarity often implies structural functional relationship

New sequences emerge by

- insertion/deletion

- substitutions

Similarity between 2 sequences is assessed by an alignment

9

References

R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge University Press, 1999), ISBN 0521629713

Arthur Lesk, Introduction to Bioinformatics (Oxford University Press), ISBN-13: 978-0199208043

Papers/articles: will be distributed at appropriate times

10

http://www.amazon.com/gp/product/0521629713?ie=UTF8&tag=insighwithinf-20&linkCode=as2&camp=1789&creative=390957&creativeASIN=0521629713




Alignment examples

11

HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL HBB_HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL

HBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D—-DMPNALSALSDLHAHKLLGB2_LUPLU NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG

HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKLF11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE

MatchesMismatchesGaps (Insertions/deletions)

The alignment scheme

Representation

Scoring

- substitution scores

- gap penalties

sampling/optimization

- Enumerate all possible alignments?


12

Alignment by enumeration

Too many to compute/enumerate!

13

GAATC CATAC GAATC- CA-TAC

GAAT-C C-ATAC GAAT-C CA-TAC

-GAAT-C C-A-TAC GA-ATC CATA-C

Substitution scores

S(a,b) = Score for substituting residue ‘a’ by residue ‘b’

S(a,b) for proteins is a 20 X 20 matrix

How do we get the values of this matrix?

- Popular matrices = PAM250, BLOSUM62, BLOSUM50...

14

)log(),(ba

ab

qqpbas =

Gap penalties

Gaps are penalized

Linear gap penalty: all gaps are penalized with equal penalty d

Affine gap penalty:

- Distinguish between opening and extension of gaps

- penalty for gap opening larger than for gap extension

15


Dynamic ProgrammingOptimization problem: Find optimal alignment by maximizing score

Optimal alignments are obtained from optimal sub-alignments

The alignment of

X1,X2......Xi

Y1,Y2......Yj

can be constructed from the 3 sub-alignments:

X1,X2......Xi-1

Y1,Y2......Yj-1

X1,X2......Xi-1

Y1,Y2......Yj

X1,X2......Xi

Y1,Y2......Yj-116

Global and local alignments

2 flavours of dynamic programming

- Global Needleman-Wunsch [1970], Peter Sellers[1974]

LGPSTKDFGKISESREFDN LNQLERSFGKINMRLEDA

- Local Smith-Waterman [1981]

-------FGKI---------------FGKI-------Overhangs are not penalized.

17

Global dynamic programmingAlign the 2 sequences X and Y

X: HEAGAWGHEEY: PAWHEAE

18

6 6 0 -3 -3 -1 -3 -1 6 0 E

-1 -1 -2 0 -3 5 0 5 -1 -2 A

6 6 0 -3 -3 -1 -3 -1 6 0 E

0 0 10 -2 -3 -2 -2 -2 0 10 H

-3 -3 -3 -3 15 -3 -3 -3 -3 -3 W

-1 -1 -2 0 -3 5 0 5 -1 -2 A

-1 -1 -2 -2 -4 -1 -2 -1 -1 -2 P

E E H G W A G A E H

BLOSUM50 substitution matrix

Linear gap penalty = -8

X X X X . . .Y Y Y . . .

A compact representation

19

XXXXXXXXXXXXX--YYYYYYYYYYYYYYY

XXXX--XXXXXXXXX-YYYYYYYYYY--YYYY

All possible alignments can be enumerated in this matrix representation

The alignment schemeRepresentation

Scoring

- substitution scores

- gap penalties

sampling/optimization


- Compute scoring matrix

- Recurrence formula

- Traceback20

Global dynamic programming: recurrence relationship

21

X1,X2......Xi-1

Y1,Y2......Yj-1

X1,X2......Xi

Y1,Y2......Yj-1

X1,X2......Xi-1

Y1,Y2......Yj

X1,X2......Xi

Y1,Y2......Yj

F(i-1,j-1) F(i,j-1)

F(i-1,j) F(i,j)

+S(Xi,Yj)

-d

-d


F(i,j) = Maximum of

F(i-1,j-1) + S(Xi,Yj)

F(i,j-1) -d

F(i-1,j) -d

Base conditions:

F(i,0) = -i * d

F(0,j) = -j * d

22

{

Global dynamic programming: Scoring Matrix computation

H E A G A W G H E E0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80

P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73

A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60

W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37

H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19

E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5

A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2

E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1

! F(1,0) + s(-,P) = -8 + (-8) = -16F(1,1) = max F(0,0) + s(H,P) = 0 + (-2) = -2! F(0,1) + s(H,-) = -8 + (-8) = -16

! F(2,0) + s(-,P) = -16 + (-8) = -24!F(2,1) = max F(1,0) + s(E,P) = -8 + (-1) = -9! F(1,1) + s(E,-) = -2 + (-8) = -10

F(i,j-1) + s(-,yj)F(i,j) = max F(i-1,j-1) + s(xi,yj)! F(i-1,j) + s(xi,-)

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

0

0

?

Global dynamic programming: Traceback

H E A G A W G H E E0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80

P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73

A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60

W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37

H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19

E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5

A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2

E -54 -38 -24 -11 -6 -12 -14 -15 -12 -9 1

G A W G H E - E- A W – H E A E

1 2 3 4 5 6 7 8 9 100

0

1

2

3

4

5

6

7 1

-5

3

-3

-13-5

-20

-25-17

-16-80

H-

E-

AP

Dynamic programming: computational complexity

• F-matrix computation:– We store (n+1) x (m+1) numbers– Each number involves a constant number of computations

• Three sums and a maximum

• • Traceback:

– O(m+n) time

• Complexity of the algorithm:– O(mn) time– O(mn) memory

For sequences of comparable lengths, the dynamic programming algorithm is O(n2)– Computationally tractable

26

Global and local alignments

2 flavours of dynamic programming

- Global Needleman-Wunsch [1970], Peter Sellers[1974]

LGPSTKDFGKISESREFDN LNQLERSFGKINMRLEDA

- Local Smith-Waterman [1981]

-------FGKI---------------FGKI-------Overhangs are not penalized.

27

Local alignments

Global Alignment shortcomingsEffective when the two sequences have detectable sequence similarity over their entire lengths

It could be inaccurate when we are only interested in finding best alignment between subsequences

Local AlignmentsEffective when the two sequences have detectable sequence similarity over their entire lengths

Problem: Given two sequences X and Y, find subsequences α and β of X and Y, respectively, whose alignment score is maximum over all pairs of subsequences from X and Y

28

Local dynamic programming: recurrence relationship

F(i,j) = Maximum of

F(i-1,j-1) + S(Xi,Xj)

F(i,j-1) -d

F(i-1,j) -d

Base conditions:

F(i,0) = -i * d

F(0,j) = -j * d

29

0

0

0 0

0

{

Smith, T. F. and Waterman, M. S. 1981. Journal of Molecular Biology 147:195-197

Local dynamic programming

30

H E A G A W G H E E0 0 0 0 0 0 0 0 0 0 0

P 0

A 0

W 0

H 0

E 0

A 0

E 0

H E A G A W G H E E0 0 0 0 0 0 0 0 0 0 0

P 0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 5 0 5 0 0 0 0 0

W 0 0 0 0 2 0 20 12 4 0 0

H 0 10 2 0 0 0 12 18 22 14 6

E 0 2 16 8 0 0 4 10 18 28 20

A 0 0 8 21 13 5 0 4 10 20 27

E 0 0 6 13 18 12 4 0 4 16 26

Smith-Waterman algorithm

H E A G A W G H E E0 0 0 0 0 0 0 0 0 0 0

P 0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 5 0 5 0 0 0 0 0

W 0 0 0 0 2 0 20 12 4 0 0

H 0 10 2 0 0 0 12 18 22 14 6

E 0 2 16 8 0 0 4 10 18 28 20

A 0 0 8 21 13 5 0 4 10 20 27

E 0 0 6 13 18 12 4 0 4 16 26

28

22

1220

5

0

AA

WW

G-

HH

EE

Local dynamic programming: Traceback

Global vs Local dynamic programming

34

Recurrence relationship Similar in both casesSimilar in both cases

Traceback Start with the last value Start with the highest value

Overhangs Penalized Not-penalized

Output Optimal alignment between whole sequences

Optimal alignment between sub-sequences

Sensitivity Detecting sequence with high similarity

Better at detecting remote similarities

Computational Complexity O(n*n) in time and memoryO(n*n) in time and memory

Gap penalties

Gaps are penalized

Linear gap penalty: all gaps are penalized with equal penalty d

Affine gap penalty:

- Distinguish between opening and extension of gaps

- penalty for gap opening larger than for gap extension

35


Affine gap penaltiesGap penalty: G = U + k*L

- U = gap open penalty; k= gap extension penalty; L = gap length

Base conditions -

36

H E A G A W G H E E0 U U+k U+2k U+3k U+4k U+5k U+6k U+7k U+8k U+9k

P U

A U+1k

W U+2k

H U+3k

E U+4k

A U+5k

E U+6k


37

X1,X2......Xi-1

Y1,Y2......Yj-1

X1,X2......Xi

Y1,Y2......Yj-1

X1,X2......Xi-1

Y1,Y2......Yj

X1,X2......Xi

Y1,Y2......Yj

Affine gap penalties: recurrence relationships

38

X1 X2 . . . . . Xi-1 Xi

Y1 Y2 . . . . . Yj -

X1 X2 . . . . . Xi-1 Xi

Y1 Y2 . . . . Yj - -

X1 X2 . . . . . Xi-1 Xi

Y1 Y2 . . . . . Yj-1 Yj

F(i,j) = V(i-1,j-1) + S(Xi,Yi)

G(i,j) = V(i-1,j) - U

G(i,j) = G(i-1,j) - k*1


39

F(i,j) = V(i-1,j-1) + S(Xi,Yi)

H(i,j) = V(i,j-1) - U

H(i,j) = H(i,j-1) - k*1

X1 X2 . . . . . Xi-1 -

Y1 Y2 . . . . . Yj Yj

X1 X2 . . . . . - -

Y1 Y2 . . . . . Yj-1 Yj

X1 X2 . . . . . Xi-1 Xi

Y1 Y2 . . . . . Yj-1 Yj


V(i,j) = max{F(i,j), G(i,j), H(i,j)}

F(i,j) = V(i-1,j-1) + S(Xi,Yj)

G(i,j) = max

H(i,j) = max

40

V(i-1,j) - UG(i-1,j) - k*1

V(i,j-1) - UG(i,j-1) - k*1

{

{

bioinformatics bio314madhusudhan/bio314_2017/bio_314_lecture_1… · r. durbin, s. r. eddy, a....

Documents