bnfo 602 lecture 2

29
BNFO 602 Lecture 2 Usman Roshan

Upload: albany

Post on 13-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

BNFO 602 Lecture 2. Usman Roshan. -3 mil yrs. AAGACTT. AAGACTT. AAGACTT. AAGACTT. AAGACTT. -2 mil yrs. AAGGCTT. AAG G CTT. AAGGCTT. AAGGCTT. T_GACTT. T_GACTT. T _ GACTT. T_GACTT. -1 mil yrs. _GGGCTT. _ G GGCTT. _GGGCTT. TAGACCTT. T AG A C CTT. TAGACCTT. A _ C ACTT. A_CACTT. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: BNFO 602 Lecture 2

BNFO 602Lecture 2

Usman Roshan

Page 2: BNFO 602 Lecture 2

DNA Sequence Evolution

AAGACTT -3 mil yrs

-2 mil yrs

-1 mil yrs

today

AAGACTT

T_GACTTAAGGCTT

_GGGCTT TAGACCTT A_CACTT

ACCTT (Cat)

ACACTTC (Lion)

TAGCCCTTA (Monkey)

TAGGCCTT (Human)

GGCTT(Mouse)

T_GACTTAAGGCTT

AAGACTT

_GGGCTT TAGACCTT A_CACTT

AAGGCTT T_GACTT

AAGACTT

TAGGCCTT (Human)

TAGCCCTTA (Monkey)

A_C_CTT (Cat)

A_CACTTC (Lion)

_G_GCTT (Mouse)

_GGGCTT TAGACCTT A_CACTT

AAGGCTT T_GACTT

AAGACTT

Page 3: BNFO 602 Lecture 2

Sequence alignments

They tell us about

• Function or activity of a new gene/protein

• Structure or shape of a new protein

• Location or preferred location of a protein

• Stability of a gene or protein

• Origin of a gene or protein

• Origin or phylogeny of an organelle

• Origin or phylogeny of an organism

• And more…

Page 4: BNFO 602 Lecture 2

Pairwise alignment

• X: ACA, Y: GACAT• Match=8, mismatch=2, gap-5

ACA-- -ACA- --ACA ACA----GACAT GACAT GACAT G--ACAT8+2+2-5-5 -5+8+8+8-5 -5-5+2+2+2 2-5-5-5-5-5-5Score = 2 14 -4 -28

Page 5: BNFO 602 Lecture 2

Optimal alignment

• An alignment can be specified by the traceback matrix.

• How do we determine the traceback for the highest scoring alignment?

• Needleman-Wunsch algorithm for global alignment– First proposed in 1970 – Widely used in genomics/bioinformatics– Dynamic programming algorithm

Page 6: BNFO 602 Lecture 2

Needleman-Wunsch

• Input: – X = x1x2…xn, Y=y1y2…ym – (X is seq2 and Y is seq1)

• Define V to be a two dimensional matrix with len(X)+1 rows and len(Y)+1 columns

• Let V[i][j] be the score of the optimal alignment of X1…i and Y1…j.

• Let m be the match cost, mm be mismatch, and g be the gap cost.

Page 7: BNFO 602 Lecture 2

Dynamic programmingInitialization:for i = 1 to len(seq2) { V[i][0] = i*g; }For i = 1 to len(seq1) { V[0][i] = i*g; }

Recurrence:for i = 1 to len(seq2){

for j = 1 to len(seq1){

V[i-1][j-1] + m(or mm)V[i][j] = max { V[i-1][j] + g

V[i][j-1] + g

if(maximum is V[i-1][j-1] + m(or mm)) then T[i][j] = ‘D’else if (maximum is V[i-1][j] + g) then T[i][j] = ‘U’else then T[i][j] = ‘L’

}}

Page 8: BNFO 602 Lecture 2

Example

Input: seq2: ACAseq1: GACAT

m = 5mm = -4gap = -20

seq2 is lined along the rowsand seq2 is along the columns

0 -20 -40 -60 -80 -100

-20 -4 -15 -35 -55 -75

-40 -24 -8 -10 -30 -50

-60 -44 -19 -12 -5 -25

L L L L L

U D D L L L

U U D D L L

U U D D D L

V

T

G A C A T

ACA

Page 9: BNFO 602 Lecture 2

Affine gap penalties

• Affine gap model allows for long insertions in distant proteins by charging a lower penalty for extension gaps. We define g as the gap open penalty (first gap) and e as the gap extension penalty (additional gaps)

• Alignment:– ACACCCT ACACCCC

– ACCT T AC CTT

– Score = 0 Score = 0.9

• Trivial cost matrix: match=+1, mismatch=0, gapopen=-2, gapextension=-0.1

Page 10: BNFO 602 Lecture 2

Affine penalty recurrence

V (i, j) = max{E(i, j),F(i, j),M(i, j)

M(i, j) =V (i −1, j −1) + s(x i,y j )

E(i, j) = max{E(i, j −1) + ext,V (i, j −1) + g}

F(i, j) = max{F(i −1, j) + ext,V (i −1, j) + g}

M(i,j) denotes alignments of x1..i and y1..j ending witha match/mismatch. E(i,j) denotes alignments of x1..i

and y1..j such that yj is paired with a gap. F(i,j) definedsimilarly. Recursion takes O(mn) time where m and n are lengths of x and y respectively.

Page 11: BNFO 602 Lecture 2

How do we pick gap parameters?

Page 12: BNFO 602 Lecture 2

Structural alignments

• Recall that proteins have 3-D structure.

Page 13: BNFO 602 Lecture 2

Structural alignment - example 1

Alignment of thioredoxins fromhuman and fly taken from theWikipedia website. This proteinis found in nearly all organismsand is essential for mammals.

PDB ids are 3TRX and 1XWC.

Page 14: BNFO 602 Lecture 2

Structural alignment - example 2

Computer generated aligned proteins

Unaligned proteins.2bbm and 1top areproteins from fly andchicken respectively.

Taken from http://bioinfo3d.cs.tau.ac.il/Align/FlexProt/flexprot.html

Page 15: BNFO 602 Lecture 2

Structural alignments

• We can produce high quality manual alignments by hand if the structure is available.

• These alignments can then serve as a benchmark to train gap parameters so that the alignment program produces correct alignments.

Page 16: BNFO 602 Lecture 2

Benchmark alignments

• Protein alignment benchmarks– BAliBASE, SABMARK, PREFAB,

HOMSTRAD are frequently used in studies for protein alignment.

– Proteins benchmarks are generally large and have been in the research community for sometime now.

– BAliBASE 3.0

Page 17: BNFO 602 Lecture 2

Biologically realistic scoring matrices

• PAM and BLOSUM are most popular• PAM was developed by Margaret

Dayhoff and co-workers in 1978 by examining 1572 mutations between 71 families of closely related proteins

• BLOSUM is more recent and computed from blocks of sequences with sufficient similarity

Page 18: BNFO 602 Lecture 2

PAM

• We need to compute the probability transition matrix M which defines the probability of amino acid i converting to j

• Examine a set of closely related sequences which are easy to align---for PAM 1572 mutations between 71 families

• Compute probabilities of change and background probabilities by simple counting

Page 19: BNFO 602 Lecture 2

Expected accuracy alignment

• The dynamic programming formulation allows us to find the optimal alignment defined by a scoring matrix and gap penalties. This may not necessarily be the most “accurate” or biologically informative.

• We now look at a different formulation of alignment that allows us to compute the most accurate one instead of the optimal one.

Page 20: BNFO 602 Lecture 2

Posterior probability of xi aligned to yj

• Let A be the set of all alignments of sequences x and y, and define P(a|x,y) to be the probability that alignment a (of x and y) is the true alignment a*.

• We define the posterior probability of the ith residue of x (xi) aligning to the jth residue of y (yj) in the true alignment (a*) of x and y as

Do et. al., Genome Research, 2005

P(x i ~ y j ∈ a* | x,y) = P(a | x,y)1{x i ~ y j ∈ a}

a∈A

Page 21: BNFO 602 Lecture 2

Expected accuracy of alignment

• We can define the expected accuracy of an alignment a as

• The maximum expected accuracy alignment can be obtained by the same dynamic programming algorithm

V i j

V i j P x y

V i j

V i j

i j

( , ) max

( , ) ( ~ )

( , )

( , )

=− − +

−−

⎨⎪

⎩⎪

⎬⎪

⎭⎪

1 111

Do et. al., Genome Research, 2005

Page 22: BNFO 602 Lecture 2

Example for expected accuracy

• True alignment• AC_CG• ACCCA• Expected accuracy=(1+1+0+1+1)/4=1

• Estimated alignment• ACC_G• ACCCA• Expected accuracy=(1+1+0.1+0+1)/4 ~ 0.75

Page 23: BNFO 602 Lecture 2

Estimating posterior probabilities• If correct posterior probabilities can be computed

then we can compute the correct alignment. Now it remains to estimate these probabilities from the data

• PROBCONS (Do et. al., Genome Research 2006): estimate probabilities from pairwise HMMs using forward and backward recursions (as defined in Durbin et. al. 1998)

• Probalign (Roshan and Livesay, Bioinformatics 2006): estimate probabilities using partition function dynamic programming matrices

Page 24: BNFO 602 Lecture 2

Example

Page 25: BNFO 602 Lecture 2

Local alignment

• Global alignment recurrence:

• Local alignment recurrence

V (i, j) =max

V(i −1, j −1)+S(xi ,yj )V(i −1, j) + gV(i, j −1)+ g

⎨⎪

⎩⎪

⎬⎪

⎭⎪

V (i, j) =max

0V(i −1, j −1) +S(xi ,yj )

V(i −1, j) + gV(i, j −1) + g

⎨⎪⎪

⎩⎪⎪

⎬⎪⎪

⎭⎪⎪

Page 26: BNFO 602 Lecture 2

Local alignment traceback

• Let T(i,j) be the traceback matrices and m and n be length of input sequences.

• Global alignment traceback: – Begin from T(m,n) and stop at T(0,0).

• Local alignment traceback: – Find i*,j* such that T(i*,j*) is the maximum over all T(i,j).– Begin traceback from T(i*,j*) and stop when

T(i,j) <= 0.

Page 27: BNFO 602 Lecture 2

BLAST

• Local pairwise alignment heuristic

• Faster than standard pairwise alignment programs such as SSEARCH, but less sensitive.

• Online server: http://www.ncbi.nlm.nih.gov/blast

Page 28: BNFO 602 Lecture 2

BLAST

1. Given a query q and a target sequence, find substrings of length k (k-mers) of score at least t --- also called hits. k is normally 3 to 5 for amino acids and 12 for nucleotides.

2. Extend each hit to a locally maximal segment. Terminate the extension when the reduction in score exceeds a pre-defined threshold

3. Report maximal segments above score S.

Page 29: BNFO 602 Lecture 2

Finding k-mers quickly

• Preprocess the database of sequences:– For each sequence in the database store all k-

mers in hash-table.– This takes linear time

• Query sequence:– For each k-mer in the query sequence look up the

hash table of the target to see if it exists– Also takes linear time