multiple sequence composition alignment

Post on 18-Jan-2016

68 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Multiple Sequence Composition Alignment. Name: Yip Chi Kin Date: 21-12-2006. Studied Papers. [B03] Composition Alignment. [S98] Divide-and-conquer Alignment. [M99] DIALIGN Algorithm. [SMS03] DCA + Segment-based. Main Aspects. ․Dynamic Programming - PowerPoint PPT Presentation

TRANSCRIPT

Multiple Sequence

Composition Alignment

Name: Yip Chi KinDate: 21-12-2006

Studied Papers

[SMS03] DCA + Segment-based

[B03] Composition Alignment

[M99] DIALIGN Algorithm

[S98] Divide-and-conquer Alignment

Main Aspects․Dynamic Programming․Composition Alignment․Meta-code MSA․Simultaneous MSA

Pairwise Library (Global & Local) Consistency & UngappedDivide-and-conquerSegment-based (Optimal scores)

Dynamic Programming DP Matrix C T G A

CTGA

••

Dot Matrix

max, jiS),(1,1 jiji basS

dS ji ,1

dS ji 1,

Edit GraphC T G

C

T

A

matches

deletions

insertions

1,1 jiS

jiS ,1

1, jiS

),( ji bas-d

-ds(ai,bi)

Global Alignment

- C T T C T

-

G

C

A

T

C -20-3-4-7-10

-4-3-1-2-5-8

-5-5-3-2-3-6

-6-4-4-2-1-4

-9-7-5-3-1-2

-10-8-6-4-20

Needleman-Wunsch Algorithm

GA ResultsG C A T C -- C T T C T

Scoring

2),(),( 11 jiji basbas

1),( jiji basthenbaif1),( jiji basthenbaif

jSiS ji 2,2 ,00,

Local Alignment

Smith-Waterman Algorithm

GA Results- G A A C – G G T - -T T T A C A G G C A G

135430002220

324651000000

312343200000

212012400000

130012120000

021102020000

200220000000

000000000000

- T T T A C A G G C A G-GAACGGT

2),(),( 11 jiji basbas

2),( jiji basthenbaif1),( jiji basthenbaif

Scoring0,00, ji SS

MSA Methods․Consistency-based․Exact method․Progressive method․Iterative method․Stochastic method․Hidden Markov method

MSA Concepts

C - G T C TC T G T C C

C - - T G T C CC G A T A T - T

C G - - - T C TC G A T A T - T

PSAs

Trace formulation

C

T

C

C

T

T

G

G

T

C

C T

T

A

C

TT

T

A

G

G

C

CT

G

A

T

T

C

T

C

A

T

G

C

C

C

T

C

C

T

T

G

G

T

C

C

Latter formulation

T

G

A

T

T

C

T

C

A

T

G

C

C T

G

A

T

T

T

T

C

A

C

GC

Consistency-based method

MSA Results

Aligned regions

C

T

C

C

T

T

G

G

T

C

C T

G

A

T

T

C

T

C

A

T

G

C

C T

G

A

T

T

T

T

C

A

C

GC

T

T

C

C

T

T

G

G

C

C

T

G

A

T

T

C

T

C

A

T

G

C

C

Results of MSA

G T CCT--C

G T TC---C

A T T-TAGC

G T TC---C

Unrealized Consistent

Realized

Divide-and-conquer

S1

S2

S3

C1

C2

C3

S1C1

S2C2

S3C3

C1S1

C2S2

C3S3

Divide

Divide Divide

Align optimally

Concatenate

SuffixPrefix

DP Distance

3 4 3 4 6 8 10 4 2 3 2 4 6 8 6 4 2 2 2 4 6 8 6 4 2 1 2 410 8 6 4 2 0 212 10 8 6 4 2 0

C T A T A C -

GTATC-

0 2 4 6 8 10 12 2 1 3 5 7 9 11 4 3 1 3 5 7 9 6 5 3 1 3 5 7 8 7 5 3 1 3 510 8 7 5 3 2 3

- C T A T A C

-GTATC

Wopt (prefix)

Wopt (suffix)CS1,S2[C1,C2] = Wopt (prefix) + Wopt (suffix) – Wopt (total)

Sequence: GTTCATGCCAGGTGTAAATC

SuffixPrefix

Additional-cost

CS1,S2[2,2] = 1 + 2 – 3 = 0

= Wopt [CT,GT] + Wopt [ATAC,ATAC] – Wopt [CTATC,GTATAC]

CS1,S2[4,3] = 3 + 1 – 3 = 1

= Wopt [CTAT,GTA] + Wopt [AC,TAC] – Wopt [CTATC,GTATAC]

0 3 4 7 11 15 19 3 0 3 4 8 12 16 7 4 0 2 4 8 1211 8 4 0 1 4 815 12 8 4 0 0 419 15 12 8 4 1 0

C T A T A C

GTATC

CS1,S2[1,1] = 0

CS1,S2[2,2] = 0

CS1,S2[3,3] = 0

CS1,S2[4,4] = 0

CS1,S2[5,4] = 0

CS1,S2[6,5] = 0

Cost of Diagonal

Space & Time‘Chain’ of boxes

along Diagonal in order to reduce searching time

Full sequence searching

DIALIGN

y I A - V L F - A E d

- L A c V I F - G s -

p w d d V T F d A E -

GA ResultsConsistent diagonals

Non-Consistent (Simultaneous)

Non-Consistent (Cross over)

I A V L F A E D

L A C V I F G S

P W D D V T D A EF

Y

I A V L F A E D

L A C V I F G S

P W D D V T D A EF

Y I A V L F A E D

L A C V I F G S

P W D D V T D A EF

Y

WeightingDiagonal Weightsw(D) = – log P(lD, SD)

where SD is sum of similarity values of same diagonal lD

lD is length of diagonal D

Overlap weighting Y I A V L F A Y D D

L A C V I F G S

S W D D V M F Y A E

Y I A V L F A Y D D

L A C V I F G S

S W D D V M F Y A E

Y I A V L F A Y D D

L A C V I F G S

S W D D V M F Y A E

Diagonals D1 , D4 and D5

Score = 1.9 + 2.6 + 0.2 = 4.7

Diagonals D1 , D2 , D3 and D5

Score = 1.9 + 1.7 + 1.5 + 0.2 = 5.3

w(D1) = 1.9

w(D2) = 1.7

w(D3) = 1.5

w(D4) = 2.6

w(D5) = 0.2

Consistency check1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

S1

S2

S3

f2

f1

f3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

S1

S2

S3

Overlap weights

Fragments checking

Transitivity frontier [1,9]

GreedyStrategy

Greedy ApproachTandem duplications

Consistency conflicts

M2

M1 (2)

M1 (1)

M1 (2)

M1

(1)

M2(1)

M2

(2)

M1 (2)

M1

(1)

M2

M3

M1 (2)

M1

(1)

M2

M3

S1

S2

S3

S1

S2

S3

S1

S2

S1

S2

Composition Alignment Single

character match

Composition matches

A A C G T C T T T G A G C T C

A G C C T G A C T - G C C T A

0 1 0 1 1 1 0 1 0 0 0 1 0 0 00 1 0 0 0 1 1 0 1 1 1 0 1 1 1

+ + – + – – – + – – –0 0 0 1 2 2 1 2 1 0 -1 0 -1 -2 -3

CM of Prefix Length

Matchin

g Prefix length

Sequence #1Sequence #2

Match Length

Prefix length

CompositionMatching

2

1

–1

–2

0

–3

94 15

3

2

2 2

2

7

Replaced by 7

Replaced by 2

111010001001101110

1110100 010011011 10

Replaced by

Composition Matching

0 1 0 1 1 1 0 1 0 0 0 1 0 0 0

0 1 0 0 0 1 1 0 1 1 1 0 1 1 1

CM of Prefix Length (Total=9)Sequence

#1

Sequence #2

0 1 0 1 1 1 0 1 0 0 0 1 0 0 0

0 1 0 0 0 1 1 0 1 1 1 0 1 1 1

Sequence #1

Sequence #2

0 1 0 1 1 1 0 1 0 0 0 1 0 0 0

0 1 0 0 0 1 1 0 1 1 1 0 1 1 1

Sequence #1

Sequence #2

CM = 2

CM = 1

CM = 0

CM = -1

Meta-Code

Code about code

Meta-CodeOriginal Code

Mismatch code

Matchcode

Code forTesting

Input code

Control Rule

Code ReservoirMismatch

code

Reservoir Codes

Code ‘CT’

Store code in Reservoir S2

Code from S1

Code from S2

Store code in Reservoir S1

If both Codes founded from Reservoir S1

and Reservoir S2 delete this

two codes

Reservoir Code (e.g. AGRCT)

Code ‘G’

Code ‘G’

Code ‘C’

Code ‘C’

Code ‘AG’

Code ‘A’ in S1

Code ‘T’ in S2

Meta-Code Rule

Meta-code (e.g. AMT)Codes

from S1 and S2

Copy the codes fromS1 and S2, p = p –1,output meta-code.

If CM length is valid,reservoir code = r,

Position = p.

Values ofr and p

Value of r

If reservoir code = r,then stop the looping

Looping for creating

meta-code

CM (Lengths & Codes)

A A A A AG

A G GA

GA

AG

AG

T T T T TC

C C CT

CT

TT

CC

Reservoir codes in S1

S1: S2:

MetaCode

Length

T A C T C G G A CT T C G C C A T C

0 1 1 1 1 2 1 2 2R ART ART ART ART AGRCT GRC GARTC GARTC

GT

2AGRTT

TC

2AGRCC

CG

1ARC

Reservoir codes in S2

Composition Matching of S1 and S2 in prefix length

CM of Metacode

Prefix length

CompositionMatching

0

2

1

–1

106 12

2 4GARTCART ART AGRCT

ART ARCART

AGRTT GARTC

Invalid length

Invalid length

2

Composition MSA

T A C G T C G T C G A CT T C T G C C C G A T CT T C TMG GMT C C CMT GMC AMG TMA CT T C G T C C T C G A C

Composition matching

Meta-code MSA

New

S1

S2

S2

| | | |T T C T G C C C G A T C

T A C G T C G T C G A C

Fixed SegmentA = Currency / CardsB = Stock / Structured P.C = Unit Trusts / BondsD = Insurance / Finance

Code catalogue

E = Mortgages / Loans

Week #1 Week #2

Branch bank #1 …Branch bank #2 …Branch bank #3 …

A C B C E B A A E B C A

B E B A A A C E D B E A

A C A A B C E E D B E E

Time Granularities

…1t1 1t2 1t3 1t4 1t5 2t1 2t2 2t3 2t4 2t5 3t1 3t2

․Weekly behaviourSegment Length LS =

5

․Semi-global alignment․Least overlap problem․Simple segmentation․Composition alignment

Family Classifications

Fixed-Segment Composition MSA

Branch bank #1Branch bank #2Branch bank #3

C C B C A D

A A B A D C

A B A A B BCompositionalignment

Family

Group

Meta-Code Branch bank #1Branch bank #2

Meta-Code Branch bank #3

C C B A D C

A A B A D C

A A B A B BPSA

Family

Group

Further Problems

․Fixed-segment length․Prior sequence choice․Speed-up PSAs․Nos. of Segments/Codes

Meta-Code Composition MSA

Conclusions․Fixed-segment Composition (Least Overlap Problems)

․Meta-code Approach (Easier Transform Applications)

․Widespread use of MSA (Simultaneous Multiple Sequences)

top related