bnfo 602 lecture 2

BNFO 602Lecture 2

Usman Roshan

DNA Sequence Evolution

AAGACTT -3 mil yrs

-2 mil yrs

-1 mil yrs

today

AAGACTT

T_GACTTAAGGCTT

_GGGCTT TAGACCTT A_CACTT

ACCTT (Cat)

ACACTTC (Lion)

TAGCCCTTA (Monkey)

TAGGCCTT (Human)

GGCTT(Mouse)

T_GACTTAAGGCTT

AAGACTT


AAGGCTT T_GACTT

AAGACTT

TAGGCCTT (Human)

TAGCCCTTA (Monkey)

A_C_CTT (Cat)

A_CACTTC (Lion)

_G_GCTT (Mouse)


AAGGCTT T_GACTT

AAGACTT

Sequence alignments

They tell us about

• Function or activity of a new gene/protein

• Structure or shape of a new protein

• Location or preferred location of a protein

• Stability of a gene or protein

• Origin of a gene or protein

• Origin or phylogeny of an organelle

• Origin or phylogeny of an organism

• And more…

Pairwise alignment

• X: ACA, Y: GACAT• Match=8, mismatch=2, gap-5

ACA-- -ACA- --ACA ACA----GACAT GACAT GACAT G--ACAT8+2+2-5-5 -5+8+8+8-5 -5-5+2+2+2 2-5-5-5-5-5-5Score = 2 14 -4 -28

Optimal alignment

• An alignment can be specified by the traceback matrix.

• How do we determine the traceback for the highest scoring alignment?

• Needleman-Wunsch algorithm for global alignment– First proposed in 1970 – Widely used in genomics/bioinformatics– Dynamic programming algorithm

Needleman-Wunsch

• Input: – X = x1x2…xn, Y=y1y2…ym – (X is seq2 and Y is seq1)

• Define V to be a two dimensional matrix with len(X)+1 rows and len(Y)+1 columns

• Let V[i][j] be the score of the optimal alignment of X1…i and Y1…j.

• Let m be the match cost, mm be mismatch, and g be the gap cost.

Dynamic programmingInitialization:for i = 1 to len(seq2) { V[i][0] = i*g; }For i = 1 to len(seq1) { V[0][i] = i*g; }

Recurrence:for i = 1 to len(seq2){

for j = 1 to len(seq1){

V[i-1][j-1] + m(or mm)V[i][j] = max { V[i-1][j] + g

V[i][j-1] + g

if(maximum is V[i-1][j-1] + m(or mm)) then T[i][j] = ‘D’else if (maximum is V[i-1][j] + g) then T[i][j] = ‘U’else then T[i][j] = ‘L’

}}

Example

Input: seq2: ACAseq1: GACAT

m = 5mm = -4gap = -20

seq2 is lined along the rowsand seq2 is along the columns

0 -20 -40 -60 -80 -100

-20 -4 -15 -35 -55 -75

-40 -24 -8 -10 -30 -50

-60 -44 -19 -12 -5 -25

L L L L L

U D D L L L

U U D D L L

U U D D D L

V

T

G A C A T

ACA

Affine gap penalties

• Affine gap model allows for long insertions in distant proteins by charging a lower penalty for extension gaps. We define g as the gap open penalty (first gap) and e as the gap extension penalty (additional gaps)

• Alignment:– ACACCCT ACACCCC

– ACCT T AC CTT

– Score = 0 Score = 0.9

• Trivial cost matrix: match=+1, mismatch=0, gapopen=-2, gapextension=-0.1

Affine penalty recurrence

€

V (i, j) = max{E(i, j),F(i, j),M(i, j)

M(i, j) =V (i −1, j −1) + s(x i,y j )

E(i, j) = max{E(i, j −1) + ext,V (i, j −1) + g}

F(i, j) = max{F(i −1, j) + ext,V (i −1, j) + g}

M(i,j) denotes alignments of x1..i and y1..j ending witha match/mismatch. E(i,j) denotes alignments of x1..i

and y1..j such that yj is paired with a gap. F(i,j) definedsimilarly. Recursion takes O(mn) time where m and n are lengths of x and y respectively.

How do we pick gap parameters?

Structural alignments

• Recall that proteins have 3-D structure.

Structural alignment - example 1

Alignment of thioredoxins fromhuman and fly taken from theWikipedia website. This proteinis found in nearly all organismsand is essential for mammals.

PDB ids are 3TRX and 1XWC.

Structural alignment - example 2

Computer generated aligned proteins

Unaligned proteins.2bbm and 1top areproteins from fly andchicken respectively.

Taken from http://bioinfo3d.cs.tau.ac.il/Align/FlexProt/flexprot.html

Structural alignments

• We can produce high quality manual alignments by hand if the structure is available.

• These alignments can then serve as a benchmark to train gap parameters so that the alignment program produces correct alignments.

Benchmark alignments

• Protein alignment benchmarks– BAliBASE, SABMARK, PREFAB,

HOMSTRAD are frequently used in studies for protein alignment.

– Proteins benchmarks are generally large and have been in the research community for sometime now.

– BAliBASE 3.0

http://www-bio3d-igbmc.u-strasbg.fr/balibase/

http://www-bio3d-igbmc.u-strasbg.fr/balibase/

Biologically realistic scoring matrices

• PAM and BLOSUM are most popular• PAM was developed by Margaret

Dayhoff and co-workers in 1978 by examining 1572 mutations between 71 families of closely related proteins

• BLOSUM is more recent and computed from blocks of sequences with sufficient similarity

PAM

• We need to compute the probability transition matrix M which defines the probability of amino acid i converting to j

• Examine a set of closely related sequences which are easy to align---for PAM 1572 mutations between 71 families

• Compute probabilities of change and background probabilities by simple counting

Expected accuracy alignment

• The dynamic programming formulation allows us to find the optimal alignment defined by a scoring matrix and gap penalties. This may not necessarily be the most “accurate” or biologically informative.

• We now look at a different formulation of alignment that allows us to compute the most accurate one instead of the optimal one.

Posterior probability of xi aligned to yj

• Let A be the set of all alignments of sequences x and y, and define P(a|x,y) to be the probability that alignment a (of x and y) is the true alignment a*.

• We define the posterior probability of the ith residue of x (xi) aligning to the jth residue of y (yj) in the true alignment (a*) of x and y as

Do et. al., Genome Research, 2005

€

P(x i ~ y j ∈ a* | x,y) = P(a | x,y)1{x i ~ y j ∈ a}

a∈A

∑

Expected accuracy of alignment

• We can define the expected accuracy of an alignment a as

• The maximum expected accuracy alignment can be obtained by the same dynamic programming algorithm

V i j

V i j P x y

V i j

V i j

i j

( , ) max

( , ) ( ~ )

( , )

( , )

=− − +

−−

⎧

⎨⎪

⎩⎪

⎫

⎬⎪

⎭⎪

1 111

Do et. al., Genome Research, 2005

Example for expected accuracy

• True alignment• AC_CG• ACCCA• Expected accuracy=(1+1+0+1+1)/4=1

• Estimated alignment• ACC_G• ACCCA• Expected accuracy=(1+1+0.1+0+1)/4 ~ 0.75

Estimating posterior probabilities• If correct posterior probabilities can be computed

then we can compute the correct alignment. Now it remains to estimate these probabilities from the data

• PROBCONS (Do et. al., Genome Research 2006): estimate probabilities from pairwise HMMs using forward and backward recursions (as defined in Durbin et. al. 1998)

• Probalign (Roshan and Livesay, Bioinformatics 2006): estimate probabilities using partition function dynamic programming matrices

Example

Local alignment

• Global alignment recurrence:

• Local alignment recurrence

V (i, j) =max

V(i −1, j −1)+S(xi ,yj )V(i −1, j) + gV(i, j −1)+ g

⎧

⎨⎪

⎩⎪

⎫

⎬⎪

⎭⎪

V (i, j) =max

0V(i −1, j −1) +S(xi ,yj )

V(i −1, j) + gV(i, j −1) + g

⎧

⎨⎪⎪

⎩⎪⎪

⎫

⎬⎪⎪

⎭⎪⎪

Local alignment traceback

• Let T(i,j) be the traceback matrices and m and n be length of input sequences.

• Global alignment traceback: – Begin from T(m,n) and stop at T(0,0).

• Local alignment traceback: – Find i*,j* such that T(i*,j*) is the maximum over all T(i,j).– Begin traceback from T(i*,j*) and stop when

T(i,j) <= 0.

BLAST

• Local pairwise alignment heuristic

• Faster than standard pairwise alignment programs such as SSEARCH, but less sensitive.

• Online server: http://www.ncbi.nlm.nih.gov/blast

http://www.ncbi.nlm.nih.gov/blast

BLAST

1. Given a query q and a target sequence, find substrings of length k (k-mers) of score at least t --- also called hits. k is normally 3 to 5 for amino acids and 12 for nucleotides.

2. Extend each hit to a locally maximal segment. Terminate the extension when the reduction in score exceeds a pre-defined threshold

3. Report maximal segments above score S.

Finding k-mers quickly

• Preprocess the database of sequences:– For each sequence in the database store all k-

mers in hash-table.– This takes linear time

• Query sequence:– For each k-mer in the query sequence look up the

hash table of the target to see if it exists– Also takes linear time

bnfo 602 lecture 2

Documents

gap cost

gap parameters

gap penaltiesaffine

j gvij

gap open penalty

alignment program

structural alignment

optimal alignment of