centrforintegrative bioinformaticsvu e [1] 06-11-2006 sequence analysis alignments 2: local...

37
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

Post on 21-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[1] 06-11-2006 Sequence Analysis

Alignments 2:Local alignment

Sequence Analysis06-11-2006

Page 2: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[2] 06-11-2006 Sequence Analysis

What can sequence alignment tell us about structure

HSSP Sander & Schneider, 1991

≥30% sequence identity

Page 3: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[3] 06-11-2006 Sequence Analysis

Pair-wise alignmentComplexity of the problem

Combinatorial explosion- 1 gap in 1 sequence: n+1 possibilities- 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc.

2n (2n)! 22n

= ~ n (n!)2 n 2 sequences of 300 a.a.: ~1088 alignments 2 sequences of 1000 a.a.: ~10600 alignments!

T D W V T A L KT D W L - - I K

Page 4: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[4] 06-11-2006 Sequence Analysis

The algorithm for linear gap penalties

M[i-1,j-1]+score(X[i],Y[j])

M[i,j]= max M[i,j-1]-g

M[i-1,j]-g

Corresponds to:X1…Xi-1 Xi

Y1…Yj-1 Yj

X1…Xi -

Y1…Yj-1 Yj

X1…Xi-1 Xi

Y1…Yj-1 -

Value from residueexchange matrix

i-1 i

j-1 j

Page 5: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[5] 06-11-2006 Sequence Analysis

– Substitution (or match/mismatch)

• DNA

• proteins

– Gap penalty

• Linear: gp(k)=k

• Affine: gp(k)=+k

• Concave, e.g.: gp(k)=log(k)The score for an alignment is the sum

of the scores over all alignment columns

Dynamic programmingScoring alignments

•/Gap length

pen

alt

y

affine

concave

linear

,

General alignment score:

Sa,b = )(kgpNk

k li jbas ),(

Page 6: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[6] 06-11-2006 Sequence Analysis

Scoring and gap penalties

• Scoring:– DNA or protein?– Which evolutionary distance do we cover?

• Gap penalty: always less than 0 (i.e. they lower the alignment score)– Linear: g(k)=k– Affine: g(k)=+k harder to implement

– Concave: g(k)=log(k) no efficient solution

Page 7: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[7] 06-11-2006 Sequence Analysis

Global dynamic programming – general

algorithm

M[i-1,j-1]

M[i,j] = score(X[i],Y[j]) + max max{M[0<x<i-1, j-1] - gopen - (i-x-1)gextension}

max{M[i-1, 0<x<j-1] - gopen - (i-y-1)gextension}

Value form residueexchange matrix

i-1 i

j-1 j

Gap open penalty

Gap extension penalty

Number of gap extensions

This more general way of dynamic programming also

allows for affine or other gap penalty regimes

Page 8: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[8] 06-11-2006 Sequence Analysis

Easy DP recipe for using affine gap penalties

• M[i,j] is optimal alignment (highest scoring alignment until [i,j])• Check

– Cell[i-1, j-1]: apply score for cell[i-1, j-1] – preceding row until j-2: apply appropriate gap penalties

– preceding column until i-2: apply appropriate gap penalties

i-1

j-1

Page 9: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[9] 06-11-2006 Sequence Analysis

Note about gap penalties

• Some affine schemes use

gap_penalty = -gopen –gextension*(l-1),

while others use

gap_penalty = -gopen –gextension*l,

where l is the length of the gap.

One can be converted into the other by adapting the gopen penalty

Page 10: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[10] 06-11-2006 Sequence Analysis

Global dynamic programming

gopen=10, gextension=2

D W V T A L K

0 -12 -14 -16 -18 -20 -22 -24

T -12 8 -9 -6 -5 -9 -11 -14

D -14 0 9 2 2 3 -5 -3 -34

W -16 -13 25 11 5 4 9 0 -21

V -18 -10 -4 37 21 19 19 15 -16

L -20 -14 -2 23 46 31 37 26 1

K -22 -12 -9 17 33 53 39 50 14

-34 -29 -1 17 39 27 50

D W V T A L K

T 8 3 8 11 9 9 8

D 12 1 6 8 8 4 8

W 1 25 2 3 2 6 5

V 6 2 12 8 8 10 6

L 4 6 10 9 6 14 5

K 8 5 6 8 7 5 13

These values are copied from the PAM250 matrix, after being made non-negative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix)

The extra bottom row and rightmost column give the final global alignment scores

Page 11: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[11] 06-11-2006 Sequence Analysis

Variation on global alignment

• Global alignment: the previous algorithm is called globalalignment, because it uses all letters from both sequences.

CAGCACTTGGATTCTCGG CAGC-----G-T----GG

• Semi-global alignment: don’t penalize for start/end gaps (omit the start/end of sequences).

CAGCA-CTTGGATTCTCGG ---CAGCGTGG--------– Applications of semi-global:

– Finding a gene in genome– Placing marker onto a chromosome– One sequence much longer than the other

– Danger! – really bad alignments for divergent seqs

seq X:

seq Y:

Page 12: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[12] 06-11-2006 Sequence Analysis

Global alignments - review

• Take two sequences: X[j] and Y[j]

• The best alignment for X[1…i] and Y[1…j] is called M[i, j]

• Initiation: M[0,0]=0

• Apply the equation

• Find the alignment with backtracing

6543210

76543210

AGCGGAG

0-AGTGAG-

X[j]

Y[i]

2

Page 13: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[13] 06-11-2006 Sequence Analysis

Global and local alignmentGlobal and local alignment

Local

Global

A B

A

B

A B

B A

AB

Local

Global

A BA

CC

A B C

A B C

A BA

CC

Problem: Problem:

Page 14: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[14] 06-11-2006 Sequence Analysis

Local alignment

• What’s local?– Allow only parts of the sequence to match– Results in High Scoring Segments– Locally maximal: cannot make it better by

trimming/extending the alignment

Seq X:

Seq Y:

Page 15: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[15] 06-11-2006 Sequence Analysis

Local alignment

• Why local?

– Parts of sequence diverge faster evolutionary pressure does not put constraints on the whole sequence

– Proteins have modular constructionsharing domains between sequences

Seq X:

Seq Y:

seq X:

seq Y:

seq X:

seq Y:

Page 16: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[16] 06-11-2006 Sequence Analysis

Domains - exampleImmunoglobulin domain

Page 17: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[17] 06-11-2006 Sequence Analysis

Global local alignment• Take the old, good equation

• Look at the result of the global alignment

q

e

s

-

euqes-

CAGCACTTGGATTCTCG-CA-C-----GATTCGT-G

a) global align

b) retrieve the result

c) sum score along the result

align.pos.

sum

Page 18: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[18] 06-11-2006 Sequence Analysis

Local alignment – breaking the alignment• A recipe

– Just don’t let the score go below 0

– Start the new alignment when it happens

– Where is the result in the matrix?

Before: After:

sum

align.pos.

q

e

s

0-

euqes-

q

e

0s

0-

euqes-

sum

align.pos.

sum

align.pos.

Page 19: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[19] 06-11-2006 Sequence Analysis

Local alignment – the equation

• Init the matrix with 0’s

• Read the maximal value from anywhere in the matrix

• Find the result with backtracking

M[i, j] =M[i, j-1] – g

M[i-1, j] – g

M[i-1, j-1] + score(X[i],Y[j])

max

0Great contribution to science!

q

e

s

-

euqes-

Page 20: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[20] 06-11-2006 Sequence Analysis

Finding second best alignment• We can find the best local alignment in the

sequence

• But where is the second best one?

1613A

1815G

1618201714A

15171916G

141618A

…AGA…

A clump

Best alignment

Scoring:1 for match-2 for a gap

Page 21: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[21] 06-11-2006 Sequence Analysis

Clump of an alignment

• Alignments sharing at least one aligned pair

A clump…

A

G

20A

1916G

1618A

…AGA…

Page 22: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[22] 06-11-2006 Sequence Analysis

Example: repeats proteins

Local alignment traces showing similarity between repeats

Note: repeats are similar motifs but need not be identical

Page 23: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[23] 06-11-2006 Sequence Analysis

Clumpsge

ne Y

gene X

The figure shows two proteins with so-called tandem repeats (similar motifs adjacent in the sequence)

‘Shadows’ of various alignments are visible – these all derive from parts of a same locally optimal alignment

Page 24: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[24] 06-11-2006 Sequence Analysis

Finding second best alignment

• Don’t let any matched pair contribute to the next alignment

1…

130A

02G

10001A

0000G

100A

…AGA…

Recalculate the clump

“Clear” the best alignment

Page 25: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[25] 06-11-2006 Sequence Analysis

Clumps ge

ne Y

gene X

The figure shows two proteins with so-called tandem repeats (similar motifs adjacent in the sequence)

The highest scoring local alignment is indicated

Page 26: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[26] 06-11-2006 Sequence Analysis

Clumpsge

ne Y

gene X

Page 27: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[27] 06-11-2006 Sequence Analysis

Extraction of alignments – Waterman-Eggert algorithm

1. Repeata. Retrieve the highest scoring alignment

b. Set its trace to 0

Page 28: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[28] 06-11-2006 Sequence Analysis

General local DP algorithm for using affine gap penalties

M[i,j] is optimal alignment (highest scoring alignment until [i, j]) At each cell [i, j] in search matrix, check Max coming from:

any cell in preceding row until j-2: add score for cell[i, j] minus appropriate gap penalties;

any cell in preceding column until i-2: add score for cell[i, j] minus appropriate gap penalties;

or cell[i-1, j-1]: add score for cell[i, j] Select highest scoring cell anywhere in matrix and do trace-back until

zero-valued cell or start of sequence(s)

i-1

j-1 Penalty = Popen + gap_length*Pextension

Si,j = Max

Si,j + Max{S0<x<i-1,j-1 - Pi - (i-x-1)Px}

Si,j + Si-1,j-1

Si,j + Max {Si-1,0<y<j-1 - Pi - (j-y-1)Px}

0

Page 29: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[29] 06-11-2006 Sequence Analysis

Example: general algorithm for local alignmentDP algorithm with affine gap penalties (PAM250,

Po=10, Pe=2)

Extra start/end columns/rows not necessary (no end-gaps). Each negative scoring cell is set to zero. Highest scoring cell may be found anywhere in search matrix after calculating it. Trace highest scoring cell back to first cell with zero value (or the beginning of one or both sequences)

D W V T A L K

T 0 -5 0 3 1 1 0

D 4 -7 -2 0 0 -4 0

W -7 17 -6 -5 -6 -2 -3

V -2 -6 4 0 0 2 -2

L -4 -2 2 1 -2 6 -3

K 0 -3 -2 0 -1 -3 5

D W V T A L K

T 0 0 0 3

D 4 0 0 0

W 0 21 0 0

V 0 0 25 9

L 0 0 11

K 0 0

PAM250

Page 30: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[30] 06-11-2006 Sequence Analysis

Pitfalls of alignments

Alignment is often not a reconstruction of evolution(common ancestor is usually extinct by the time of alignment)

Repeats: matches to the same fragment

seq X:

seq Y:

Page 31: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[31] 06-11-2006 Sequence Analysis

Summary

1. Global e.g Needleman-Wunsch algorithm

2. Semi-global 3. Local

e.g.Smith-Waterman algorithm

4. Many local alignmentsaka Waterman-Eggert algorithm

What’s the number of steps in these algorithms?

How much memory is used?

seq X:

seq Y:

seq X:

seq Y:

seq X:

seq Y:

seq X:

seq Y:

These questions deal with the algorithmic complexity (time, memory) of alignment

Page 32: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[32] 06-11-2006 Sequence Analysis

Semi-global alignment

• Ignore start gaps– First row/column set

to 0

• Ignore end gaps– Read the result from

last row/column

T

G

A

-

GTGAG-

1300-10

-202-210

-1-1-11-10

000000

Page 33: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[33] 06-11-2006 Sequence Analysis

Low complexity regions• Local composition bias

– Replication slippage: e.g. triplet repeats

• Results in spurious hits– Breaks down statistical

models– Different proteins reported as

hits due to similar composition

– Up to ¼ of the sequence can be biased

• Widely used filtering program: SEGS (Wootton and Federhen, 1996)

Huntington’s disease

– Huntingtin gene of unknown (!) function

– Repeats #: 6-35: normal; 36-120: disease

Page 34: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[34] 06-11-2006 Sequence Analysis

Synteny

• Synteny is preservation of gene order– 5 to 20 million years of divergence

Sequencing and comparison of yeast species to identify genes and regulatory elements, Manolis Kellis et.al, Nature 423, 241 - 254 (15 May 2003)

Arrows in figure represent genes; arrow direction indicates ‘+’ or ‘-’ strand of DNA

Figure shows preservation of gene order in four yeast species

Page 35: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[35] 06-11-2006 Sequence Analysis

Genome vs.

gene

Genome Gene

Built of... DNA DNA

Describes Organism Protein

Stored as... Part of genome

Circular/ linear Linear

“Life cycle” DNA-DNA-DNA-... DNA-RNA-protein

Size 100b ... 100000b

Amount per cell 1 500 .. 50000

2% ... 100% ~30%

Single molecule, or a few of them

Both (depending on the species)

0.5Mb*) ... 3500Mb

Information content

*) b stands for bases or nucleotides

Page 36: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[36] 06-11-2006 Sequence Analysis

Take-home messages (I)

• Global, semi-global and local alignment

• Three types of gap penalties

• ‘easy’ (fast) and general algorithm

• Pitfalls of local alignment

• Low-complexity regions

Page 37: CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[37] 06-11-2006 Sequence Analysis

Make sure you understand and can carry out

1. the ‘simple’ DP algorithm (for linear gap penalties)

2. The general DP algorithm for global, semi-global and local alignment!

Take-home messages (II)