a memory-efficient algorithm for multiple sequence alignment with constraints chin lung lu and yen...

39
A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic of China Bioinformatics, Vol. 21 no. 1 2005 Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring

Post on 21-Dec-2015

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

A memory-efficient algorithm for multiple sequence alignment with

constraints

Chin Lung Lu and Yen Pin Huang

National Chiao Tung UniversityTaiwan, Republic of China

Bioinformatics, Vol. 21 no. 1 2005

Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

Page 2: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

2

Motivation

Incorporate the biological structures and consensuses into sequence alignment

Memory efficient

Page 3: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

3

Problem Formulation -- Constraints

What is the multiple sequence alignment with constraints ?

A T C T C G C T

T G C A T A T

AT T

A T -- C -- T C G C T

-- T G C A T -- -- A T

-- -- -- A T C T C G C T

T G C A T A T -- -- -- --

1C 2C

Conserved sites of a protein

or DNA/RNA family

),...,,( 21 CCC

iiii i

cccC ...21

No overlapping between them

CCC ...21

Page 4: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

4

Problem Formulation -- Constraints

T G C A T A T

G A

iC

jS

2i

jS

iii ccC 21

1),( ij CS

Hamming Distance

0.5

1 i

Approximately appears

Page 5: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

5

Given S={s1,s2,…,sx}, and

Problem Formulation -- Constraints

A T G C A T C G C T

-- T G C A T -- -- A T

T T G C A T C A T C

L

Subseq(S2, L’)

Band L’

T G C C C

string T={t1,t2,..tk}, for

T approximately appears in L

*)),',(( kTLSsubseq i ,1 xi

Page 6: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

6

Problem Formulation

1 2 x 1 2Let S={S ,S ,...,S } over the alphabet . Let =(C ,C ,...,C )

be an ordered set of constraints. Then the CMSA of S w.r.t is an

alignment L of S over {-} with the optimal sum-of-pair score

(SP sco

'1 2 i

'

re) in which all the constraints of approximately appear in

the order of C C ... such that (subseq(S , ), )

for all 1 i x and 1 j , where is the band of L whose induced

consensus i

j j j

j

C L C

L

js C .

Constrained Multiple Sequence Alignment (CMSA)

S1

S2

S3

C1 C2 C3

Optimal Sum-of-Pair Score

CPSA

Page 7: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

7

CMSA

Pick two sequencesFind the CPSAUse it as a kernel to progressively

align more sequences

[1] Progressive Multiple Alignment with Constraints, Gene Myers et al. [2] MuSiC: A Tool for Multiple Sequence Alignment with Constraints Yin Te Tsai Chin Lung Lu Ching Ta Yu Yen Pin Huang

Page 8: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

8

Algorithm Overview

bj-1

ai-1

Find recursive relationship

ai

bj

M(i-1,j-1)

M(i,j)

Divide-and-Conquer

Page 9: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

9

Notations

1 2

1 2

1 2

1 2

...

...

( , ) ...

( , ) ...

m

n

i i

i i m i

A a a a

B bb b

pref A i a a a A

suff A i a a a A

1 2 1 2..... .....i i i ma a a a a a

iA iA

Page 10: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

10

Notations

1 2

1 2

1 2

1 2

( , j) ...

( , j) ...

( , k) ...

( , k) ...

j j

j j n j

k k

k k k

pref B b b b B

suff B b b b B

pref C C C

suff C C C

Page 11: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

11

Notation ( , )

k

DM i j

( , ) kM i j

( , ) K

IM i j

jBjB

iA iA1 2 1 1............ ..........i i m ma a a a a a

1 2 1 1........... .........j j n nbb b b b b

CkC1 Cγ … …

( , ) k

SM i j

( , ) k

DM i j

( , ) kM i j

( , ) K

IM i j

( , ) k

SM i j

( , ) kM i j

Page 12: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

12

Alignment Score

Let ( , ) be the score of an optimal constrained

alignment of and w.r.t k

i j k

M i j

A B

A

B ...C1 C2 Ck

Page 13: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

13

Alignment Score - Substitution

Let ( , ) be the maximum score of all constrained

alignments of and w.r.t that end with a substitution

pair ( , ).

Sk

i j k

i j

M i j

A B

a b

A

B ...C1 C2 Ck

ai

bj

Page 14: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

14

Alignment Score -- Deletion

i

i

Let ( , ) be the maximum scores of all constrained

alignment of A and w.r.t. that end with a deletion

pair (a , -).

Dk

j k

M i j

B

--

ai

A

B...

C1 C2 Ck

Page 15: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

15

Alignment Score -- Insertion

i

j

Let ( , ) be the maximum scores of all constrained

alignment of A and w.r.t. that end with a insertion

pair (-,b ).

Ik

j k

M i j

B

--

b j

A

B...

C1 C2 Ck

Page 16: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

16

Semi-Constrained Alignment

k-1

k

A semi-constrained alignment of and w.r.t

a constrained alignment and w.r.t , and end

with a band which is a prefix of C

i j k

i j

A B is

A B

A

B ...C1 C2 Ck-1 Ck

( , , )kN i j hh

( , )kpref C h

Page 17: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

17

Recurrence of Scores

k

if 0, then

( 1, 1) ( , )

M ( , ) max ( , )

( , )

k i jDkIk

k

M i j a b

i j M i j

M i j

Page 18: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

18

Recurrence of Scores

k

if 1 , then

( 1, 1) ( , )

( , )M ( , ) max

( , )

( , , )

k i jDkIk

k k

k

M i j a b

M i ji j

M i j

N i j

Page 19: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

19

Recurrence of Scores

1 0 1

If ( ( , ), ) and ( ( , ), )

( , , ) ( , ) ( , )

( , , )

k

i k k k j k k k

k k k k k i h j hh

k k

suff A C suff B C

N i j M i j a b

else

N i j

1 2 1............ ..........i h i ia a a a a

1 2 1........... .........j h j jbb b b b

Ck

Page 20: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

20

? ( 1, )kM i j

1 2 1...................... i ia a a a

1 2 1..................... j jbb b b

( , )DkM i j

D Sk k

D Dk k

D Ik k

Substitution: M ( , ) M ( 1, )

Deletion: M ( , ) M ( 1, )

Insertion: M ( , ) M ( 1, )

o e

e

o e

i j i j w w

i j i j w

i j i j w w

a i-1

b j

--

b j

a i-1

--

Page 21: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

21

Recurrence

Sk

D Ik k

Dk

M ( 1, )

M ( , ) max M ( 1, )

M ( 1, )

o e

o e

e

i j w w

i j i j w w

i j w

DkM ( 1, ) o ei j w w

kM ( 1, ) o ei j w w

kDk D

k

M ( 1, )M ( , ) max

M ( 1, )o e

e

i j w wi j

i j w

kIk I

k

M ( 1, )M ( , ) max

M ( 1, )o e

e

i j w wi j

i j w

Page 22: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

22

( i, j, k)( i, j-1, k)

( i-1, j-1, k) ( i-1, j, k)

Sequence B

Sequence A

Constraints

( m, n, γ )

( 0, 0, 0)

Page 23: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

23

1 0 1( , , ) ( , ) ( , )

kk k k k k i h j hh

N i j M i j a b

Nk

Page 24: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

24

Assignment

Design an algorithm to find the CPSA using dynamic programming

technique. Analyze the time and space complexity of your algorithm. For

simpilicity, you can ignore the open-gap penalty. Prove your algorithm

is consistent with the constrained set . . . it will find such a CPSA if

there exists one.

i e

Email: [email protected]

Page 25: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

25

Divide-and-Conquer

Page 26: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

26

( , , )kN i j h( , , )kN i j h

1 2 1( ) [ , ,..., , ( , )]k k kh c c c pref c h 1( ) [ ( , ), ,..., , ]k k k kh suff c h c c

( , )IkM i j( , )I

kM i j

( , )SkM i j ( , )S

kM i j

( , )kM i j ( , )kM i j

( , )DkM i j( , )D

kM i j

jBjB

iA iA1 2 1 1............ ..........i i m ma a a a a a

1 2 1 1........... .........j j n nbb b b b b

h

pref(Ck,h) suff(Ck, λk - h)

CkC1 Cγ … …

Page 27: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

27

Divide-and-Conquer

( , )mid mid midk i jL A B ( , )

mid mid midk i jL A B

Case 1: if the last pair of ( , ) is a substitution

A. ( , ) and ( , ) are optimal constrained

( , ) ( , ) ( , )

B.

mid mid mid

mid mid mid mid mid mid

k midmid

k i j

k i j k i j

Smid mid k mid mid

L A B

L A B L A B

M m n M i j M i j

L

( , ) and ( , ) are optimal semi-constrained

( , ) ( , , ) ( , , )

C. if

( , ) ( , , ) (

mid mid mid mid mid mid

mid mid

mid

mid mid mid

k i j k i j

k mid mid mid k mid mid mid

mid k

k mid mid k k m

A B L A B

M m n N i j h N i j h

h

M m n N i j M i

, ) id midj

Page 28: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

28

Divide-and-Conquer

( , )mid mid midk i jL A B ( , )

mid mid midk i jL A B

Case 2: if the last pair of ( , ) is a deletion

A. If the first pair of ( , ) is not a deletion pair

( , ) max{ ( , ) ( , ),

mid mid mid

mid mid mid

k midmid

k i j

k i j

D Smid mid k mid mid

L A B

L A B

M m n M i j M i j

( , ) ( , )}

B. If the first pair of ( , ) is a deletion pair

( , ) ( , ) ( , )

k midmid

mid mid mid

k midmid

D Imid mid k mid mid

k i j

D Dmid mid k mid mid o

M i j M i j

L A B

M m n M i j M i j w

Page 29: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

29

Summary( , )

midk mid midM i j

( , ) ( , )

( , ) ( , )

( , ) ( , )( , ) max

( , ) ( , )

( , , ) ( , , )

K midmid

K midmid

K midmid

K midmid

mid mid

mi

D Imid mid k mid mid

D Smid mid k mid mid

Smid mid k mid mid

D Dmid mid k mid mid o

k mid mid mid k mid mid mid

k

M i j M i j

M i j M i j

M i j M i jM m n

M i j M i j w

N i j h N i j h

N

( , , ) ( , )d mid midmid mid k k mid midi j M i j

( , ) ( , )K midmid

D Dmid mid k mid midM i j M i j

( , )midk mid midM i j( , )

Kmid

Dmid midM i j

Page 30: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

30

Take , , as indices , and , where 1 ,

0 and 1 , such that the following maximal value

is the maximum.

( , ) ( , )

( , ) ( , )

( , ) max ( , ) (

K

K

K

mid mid mid

k

Dmid k mid

Smid k mid

D Dmid k mi

j k h j k h j n

k h

M i j M i j

M i j M i j

M m n M i j M i

, )

( , , ) ( , , )

( , , ) ( , )

d o

k mid k mid

k mid k k mid

j w

N i j h N i j h

N i j M i j

Summary

( , ) { ( , , , )}midM m n Max F i j k h

2mid

mi

Page 31: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

31

, , ,Algorithm CPSA-DC( , , )

1. Divide A into 2, then call BestScore() and BestScoreRev(), where

the sizes of B and 's are not changed.

2. The BestScore() and BestScoreRev

start end start end start endi i j j k k

mid midi i

{ ( , , , )

() return all the alignment scores

of (A , , ) (A , , )

3 Find the where the value of j, k, h will be used as

the middle point to divide the alignment for recur

}

smi

k

d

j k jB an

max F i k h

B

j

d

ive call of

CPSA-DC()

Implementation -- CPSA-DC()

Page 32: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

32

( , ) kM i j

jB

jB

1 2 1 1............ ..........i i m ma a a a a a

1 2 1 1........... .........j j n nbb b b b b

CkC1 Cγ … …

midiA

Page 33: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

33

Complexity

k ,

A single matrix E of size ( +1)(n+1) with each entry of

4 for ( , ), ( ), ( , ), ( , )

and ( , , )

Temporary Space: V is the same size as E

Total Space: ( )

Let , the s

s D Ik mid k mid k mid k mid

k mid

M i j M i j M i j M i j

N i j h

n

mn

ize of the original problem,

then the total time complexity of CPSA-DC algorithm is

equal to ... 22 4 8

Page 34: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

34

Experimental Results

Page 35: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

35

Experimental Results

Page 36: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

36

Discussion

Lack of proof of consistency of constraints

Optimal pair-wise subsequences alignment might cause the failure of the overall optimal alignment

Page 37: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

37

Discussion

http://genome.life.nctu.edu.tw:8080/MUSICME/index.html

Page 38: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

38

Assignment

1 2 1 2

1 2

Let { , } over the alphabet . Let ( , ,..., )

be an ordered set of constraints, where ... . Then

the Constrained Pair-wise Sequence Alignment (CPSA) of S

w.r.t is an alignment

i

i i ii

S S S C C C

C c c c

1 2

'

L of S over {-} with the optimal

sum-of-pair score (SP score) in which all the constraints of

approximately appear in the order of ... such

that the hamming distance ( ( , ), )i j j

C C C

subseq S L C

'

for

all 1 2 and 1 , where 0 1, and is the band

of L whose induced consensus is . A band is a block of

consecutive columns in L.

Design an efficient algorithm to find the CPSA using

j

j

j

i j L

C

dynamic

programming technique or whatever method you prefer. For

simpilicity, you can ignore the open-gap penalty. Analyze

the time and space complexity of your algorithm. Prove your

algorithm is consistent with the constraint set . . . it will find

such a CPSA if there exists one.

i e

Page 39: A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic

39

Reference

Efficient Constrained Multiple Sequence Alignmentwith Performance GuaranteeFrancis Y.L. Chin N.L. Ho T.W. Lamy Prudence W.H. Wong M.Y. Chan

Divide-and-conquer multiple alignment withsegment-based constraintsMichael Sammeth1,∗, Burkhard Morgenstern2 and Jens Stoye 1

Multiple sequence alignment with the divide-and-conquer methodJens Stoye

MuSiC: A Tool for Multiple Sequence Alignment with Constraints Yin Te Tsai1 Chin Lung Lu2 ∗ Ching Ta Yu1 Yen Pin Huang