a table-driven, full-sensitivity similarity search algorithm gene myers and richard durbin presented...

43
A Table-Driven, Full- Sensitivity Similarity Search Algorithm Gene Myers and Richard Du rbin Presented by Wang, Jia-Nan and Huan g, Yu-Feng

Upload: felicity-reeves

Post on 13-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

A Table-Driven, Full-Sensitivity Similarity Search Algorithm

Gene Myers and Richard Durbin

Presented by Wang, Jia-Nan and Huang, Yu-Feng

Page 2: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Outline

Introduction Background Preliminary Method Experiment

Page 3: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Introduction

Given a Query and database . Do local alignment

Smith-Waterman : Guaranteed to find all local alignment . Expensive

BLAST FASTA

Page 4: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Improvement

Hardware: more investment on computer ,CPU

Software Phil Green’s SWAT appeal to sparsity and some machine-level coding tricks

60% of dynamic programming matrix has value 0

Avoiding computing most of these unproductive entries

Page 5: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Focus on improving protein similarity searches

This approach examines and compute only 4% of the underlying dynamic programming matrix

Page 6: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Recall

Sequence alignment Local sequence alignment Global sequence alignment

Goal – matching path with highest score

Table-based computation and dynamic programming

Page 7: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Dynamic Programming

Three basic components Recurrence relation Tabular computation Traceback

Page 8: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Smith-Waterman Method

Dynamic programming algorithm Find the most similar subsequences of

two sequences Problem

Lots of computation will be googol Programmer will be crazy and excite Why? how to accelerate

Page 9: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Background

Scoring System Simple scoring scheme Affine gap penalty scoring scheme PAM120 (PAMn) BLOSUM62 (BLOSUMn)

Page 10: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Simple Scoring Scheme

Match (e.g. +8) Mismatch (e.g. -5) Gap constant penalty (e.g. -20)

Page 11: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Affine Gap Penalty Scoring Scheme

Match (e.g. +8) Mismatch (e.g. -5) Gap symbol (e.g. -5) Gap open penalty (e.g. -10)

Page 12: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

PAM

PAM – Percent Accepted Mutation Dayhoff et al. (1978)

PAM unit Evolutionary time corresponding to average of 1

mutation per 100 residues 1% accepted PAMn

Relates to mutation probabilities in evolutionary interval of n PAM units

Some information from: http://www.apl.jhu.edu/~przytyck/CAMS_2004_1b.pdf

Page 13: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

PAM120

Source: http://eta.embl-heidelberg.de:8000/misc/mat/pam120.html

Page 14: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

BLOSUM62

BLOSUM – BLOcks SUbstitution Matrix Steven and Jorga G. Henikoff (1992) Paper: Amino acid substitution matrices from pro

tein blocks [PubMed] BLOSUMn

Relates to mutation probabilities observed between pairs of related proteins that diverged so above n% identity

Some information from: http://www.apl.jhu.edu/~przytyck/CAMS_2004_1b.pdf

Page 15: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

BLOSUM62C S T P A G N D E Q H R K M I L V F Y W

C 9 -1 -1 -3 0 -3 -3 -3 -4 -3 -3 -3 -3 -1 -1 -1 -1 -2 -2 -2

S -1 4 1 -1 1 0 1 0 0 0 -1 -1 0 -1 -2 -2 -2 -2 -2 -3

T -1 1 4 1 -1 1 0 1 0 0 0 -1 0 -1 -2 -2 -2 -2 -2 -3

P -3 -1 1 7 -1 -2 -1 -1 -1 -1 -2 -2 -1 -2 -3 -3 -2 -4 -3 -4

A 0 1 -1 -1 4 0 -1 -2 -1 -1 -2 -1 -1 -1 -1 -1 -2 -2 -2 -3

G -3 0 1 -2 0 6 -2 -1 -2 -2 -2 -2 -2 -3 -4 -4 0 -3 -3 -2

N -3 1 0 -2 -2 0 6 1 0 0 -1 0 0 -2 -3 -3 -3 -3 -2 -4

D -3 0 1 -1 -2 -1 1 6 2 0 -1 -2 -1 -3 -3 -4 -3 -3 -3 -4

E -4 0 0 -1 -1 -2 0 2 5 2 0 0 1 -2 -3 -3 -3 -3 -2 -3

Q -3 0 0 -1 -1 -2 0 0 2 5 0 1 1 0 -3 -2 -2 -3 -1 -2

H -3 -1 0 -2 -2 -2 1 1 0 0 8 0 -1 -2 -3 -3 -2 -1 2 -2

R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5 2 -1 -3 -2 -3 -3 -2 -3

K -3 0 0 -1 -1 -2 0 -1 1 1 -1 2 5 -1 -3 -2 -3 -3 -2 -3

M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5 1 2 -2 0 -1 -1

I -1 -2 -2 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4 2 1 0 -1 -3

L -1 -2 -2 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4 3 0 -1 -2

V -1 -2 -2 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4 -1 -1 -3

F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6 3 1

Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7 2

W -2 -3 -3 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11

Page 16: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Preliminaries

Σ : sequences are composed |Σ| × |Σ| Substitution matrix S giving th

e score Uniform gap penalty g > 0 Query = q1q2 . . . qp of P letters

Target = t1t2 . . . tn of N letters Threshold T > 0

Page 17: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Score Table Edit Graph

Picture source: http://searchlauncher.bcm.tmc.edu/help/Pictures/S-Wexample.gif

Page 18: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng
Page 19: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Problem

Find a high score local alignment between Query and Target whose path score T≧

Edit-graph figure1 Limit our attention to prefix-positive paths If there is a path of score T or greater in

the edit graph then there is a prefix positive path of score T or greater

Page 20: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Definition A set P of index-value pairs { (i,v): i is [0,P]

Page 21: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

The start and extension tables Consider a vertex x in row j of the edit

graph of Query vs. Target

Page 22: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng
Page 23: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Start Trimming

Limiting the dynamic programming to the startable vertices requires a table Start(w) where w = |Σ|ks

Page 24: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Start Trimming Worst case Let αbe the expected percentage of vertic

es that are seed

Page 25: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Extension Trimming

A table that eliminates vertices that are not extendable

(i,j) is extendable vertex iff C(i,j)>Extend(i,Target[j+1…j+ke])

Page 26: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Extension Trimming

Page 27: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng
Page 28: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

A Table-Driven Scheme for DP

Goal: to restrict the SW computation to productive vertices

Jump table – captures the effect of Advance and Delete over kJ > 0 rows

space unmanageably large But only record those for which

} versus ofgraph edit in the

is ),( to),0( frompath maximal the:),{(),(

Queryw

ukkiukwiJump j

)( 2 Jk PO

),( uk )1( Tu

Page 29: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Jump table

Start table

Space-saving version for Jump and Start tables

Page 30: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Check for paths scoring T or more

jCand

jJ CandviTkjjetTiPeakv in ),(each for )])1(...1[arg,(

Page 31: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng
Page 32: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Recall – Affine Gap Penalty

Score Match Mismatch Gap symbol - gsp Gap open penalty - gop

Affine cost of gap of length k g + kh, g = gop, h = gsp

Page 33: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Diagram of Affine Gap Penalty

CI

D

CI

D

CI

D

CI

D

-h-g-h

-g-h

-h

δ(ai,bj)

Source: kmchao’s lecture note

Page 34: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Recurrence system - Gotoh

),(

),(

),()1,1(

max),(

)1,(

)1,(max),(

),1(

),1(max),(

jiI

jiD

bajiC

jiC

hgjiC

hjiIjiI

hgjiC

hjiDjiD

ji

Page 35: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

The Case of Affine Gap Costs

Simple scoring scheme affine gap penalty scheme

Affine edit graph and vertex structure Question: how to modify the equations

defined above?

Page 36: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng
Page 37: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Recurrence System for Affine Gap Costs

Two observations To compute the jth row form the (j-1)st requires

knowing only the vectors of and values in row j-1, and not on the values in that row

If then the value at vertex need not be recorded as any maximal path through its will have score less than the maximal path passing through the corresponding

gjiCjiI ),(),(),( ji

I

vertexI

vertexC

C ID

Page 38: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Recurrence System

Page 39: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Results

Page 40: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Experiment

Method Edit graph based approach vs. SWAT

Scoring matrix PAM120

Affine gap cost 8+4n

Database (target) 3 million residue subset of the PIR database

Query A periodic clock protein of length 173 (pcp) A lactate dehydrogenase of length 319 (dehydro) A cGMP kinase of length 670 (kinase) A growth factor of length 1210 (g factor)

Page 41: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

PAM120 & Gap Cost 8+4n

Page 42: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

BLOSUM62 & Gap Cost 8+2n

Page 43: A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Thanks for Your Attention

Ending