bioinformatic phd. course

35
Bioinformatic PhD. course Bioinformatics Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) LSI Dep. de Llenguatges i Sistemes Informàtics BSC Barcelona Supercomputing Center Universitat Politècnica de Catalunya

Upload: avon

Post on 20-Jan-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Bioinformatic PhD. course. Bioinformatics Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) LSI Dep. de Llenguatges i Sistemes Informàtics BSC Barcelona Supercomputing Center Universitat Politècnica de Catalunya. Contents. 1. Biological introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Bioinformatic PhD. course

Bioinformatic PhD. course

Bioinformatics

Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

LSI Dep. de Llenguatges i Sistemes InformàticsBSC Barcelona Supercomputing Center

Universitat Politècnica de Catalunya

Page 2: Bioinformatic PhD. course

Contents

1. Biological introduction

Exact Extended Approximate

6. Projects: PROMO, MREPATT, …

5. Sequence assembly

2. Comparison of short sequences ( up to 10.000bps)

Dot Matrix Pairwise align. Multiple align. Hash alg.

3. Comparison of large sequences ( more that 10.000bps)

Data structures Suffix trees MUMs

4. String matching

Page 3: Bioinformatic PhD. course

2. Comparison of short sequences (<10.000 bps)

Summary (more or less)

• 2.1 Dot matrix• 2.2 Pairwise alignment. • 2.3 Hash algorithms.• 2.4 Multiple alignment.

Page 4: Bioinformatic PhD. course

2.2 Pairwise alignment

Given two DNA sequences A (a1a2...an) and B (b1b2...bm) from the alphabet {a,c,t,g}

we say that A* and B* from {a,c,t,g,-} are aligned iff

i) A* and B* become A and B if gaps ( – ) are removed.ii) |A*|=|B*|iii) For all i, it is not possible that ai = bi = -

Which is the best alignment?

How many alignments of two sequences exist?

MALIG (an example)

Page 5: Bioinformatic PhD. course

2.2 Number of alignments

Given two DNA sequences A (a1a2...an) and B (b1b2...bm) there are:

#(a1a2...an ,b1b2...bm) = #(a1a2...an-1 ,b1b2...bm) those that end with (an,-)+ #(a1a2...an ,b1b2...bm-1) those that end with (-,bm) + #(a1a2...an-1 ,b1b2...bm-1) those that end with (an,bm)

a1

a2

a3

b1 b2 b3#(a1,b1)

Page 6: Bioinformatic PhD. course

2.2 Number of alignments

Given two DNA sequences A (a1a2...an) and B (b1b2...bm) there are:

#(a1a2...an ,b1b2...bm) = #(a1a2...an-1 ,b1b2...bm) those that end with (an,-)+ #(a1a2...an ,b1b2...bm-1) those that end with (-,bm) + #(a1a2...an-1 ,b1b2...bm-1) those that end with (an,bm)

a1

a2

a3

b1 b2 b3

1 1 1 1111

Page 7: Bioinformatic PhD. course

2.2 Number of alignments

Given two DNA sequences A (a1a2...an) and B (b1b2...bm) there are:

#(a1a2...an ,b1b2...bm) = #(a1a2...an-1 ,b1b2...bm) those that end with (an,-)+ #(a1a2...an ,b1b2...bm-1) those that end with (-,bm) + #(a1a2...an-1 ,b1b2...bm-1) those that end with (an,bm)

a1

a2

a3

b1 b2 b3

1 1 1 1111

3 ? ?

Page 8: Bioinformatic PhD. course

2.2 Number of alignments

Given two DNA sequences A (a1a2...an) and B (b1b2...bm) there are:

#(a1a2...an ,b1b2...bm) = #(a1a2...an-1 ,b1b2...bm) those that end with (an,-)+ #(a1a2...an ,b1b2...bm-1) those that end with (-,bm) + #(a1a2...an-1 ,b1b2...bm-1) those that end with (an,bm)

a1

a2

a3

b1 b2 b3

1 1 1 1111

3 5 757 ?

Page 9: Bioinformatic PhD. course

2.2 Number of alignments

Given two DNA sequences A (a1a2...an) and B (b1b2...bm) then:

#(a1a2...an ,b1b2...bm) = #(a1a2...an-1 ,b1b2...bm) those that end with ( an , -)+ #(a1a2...an ,b1b2...bm-1) those that end with ( - , bm) + #(a1a2...an-1 ,b1b2...bm-1) those that end with ( an , bm)

a1

a2

a3

b1 b2 b3

1 1 1 1111

3 5 75713 2525 63

But, what is the assymptotic value?

Page 10: Bioinformatic PhD. course

2.2 Assymptotic value

> Σ ( ) ( )k=0

K=n

kn

kn

As

= ( )n2n

#(a1a2...an ,b1b2...bn)

and

n! ~ nn e-n (Stirling approximation)

then

#(a1a2...an ,b1b2...bn) > 22n

Page 11: Bioinformatic PhD. course

2.2 Best alignment

How can an alignment be scored?

catcactactgacgactatcgtagcgcggctatacatctacgccaa- ctac-t-gtgtagatcgccgg c- tgactgc--acgactatcgt- attgcggctacacactacgcacaactactgtatgtcgc-cgg---- * * *** * * ** * ******* * * **** **** ******* * **** ** * ***

How can the best alignment be found?

• Gap: worst case

• Mismatch: unfavorable

• Match: favorable

Then we assign a score for each case,for example 1,-1,-2.

Page 12: Bioinformatic PhD. course

2.2 Edit distance and alignment of strings

The best alignment of two strings …

…is related with the edit distance, first discussed in 1966...

The most efficient algorithm was proposed in 1968 and in 1970

using the technique called “Dynamic programming”

Page 13: Bioinformatic PhD. course

2.2 Best alignment

C T A C T A C T A C G T

ACTGA

Page 14: Bioinformatic PhD. course

2.2 Best alignment

C T A C T A C T A C G T

ACTGA

Page 15: Bioinformatic PhD. course

2.2 Best alignment

C T A C T A C T A C G T ACTGA

The cell contains the score of the best alignment of AC and CTACT.

Page 16: Bioinformatic PhD. course

2.2 Best alignment

C T A C T A C T A C G T 0 A C T GA

?

Page 17: Bioinformatic PhD. course

2.2 Best alignment

C T A C T A C T A C G T 0 -2 A C T GA

-C

?

Page 18: Bioinformatic PhD. course

2.2 Best alignment

C T A C T A C T A C G T 0 -2 -4 A C T GA

- -CT

?

Page 19: Bioinformatic PhD. course

2.2 Best alignment

C T A C T A C T A C G T 0 -2-4-6 -8 …A C T GA

- - - - - -CTACTA

Page 20: Bioinformatic PhD. course

2.2 Best alignment

C T A C T A C T A C G T 0 -2-4-6 -8 …A ?C ?T ?GA

Page 21: Bioinformatic PhD. course

2.2 Best alignment

C T A C T A C T A C G T 0 -2-4-6 -8 …A-2C-4T -6G…A

ACT - - -

Page 22: Bioinformatic PhD. course

C T A C T A C T A C G T A C TGA

2.2 Best alignment

C T A C T A C T A C G T 0 -2 -4-6 -8 …A-2C-4T -6GA

BA(AC,CTA) -C

BA(A,CTA)CC

BA(A,CTAC)C -

BA(AC,CTAC)= best

s(AC,CTAC)=max

s(AC,CTA)-2

s(A,CTA)+1

s(A,CTAC)-2

Page 23: Bioinformatic PhD. course

Best alignment

accaccacaccacaacgagcata … acctgagcgatat

acc..t

Given the maximum score,how can the best alignment be found?

• Quadratic cost in space and time

• Up to 10,000 bps sequences in length

Download alggen tool

Page 24: Bioinformatic PhD. course

2.2 Some slides revisited

We have developed the theory according to the following principles:

1) Both sequences have a similar length (global).

2) The model of gaps is linear

If there are k consecutive gaps

the penalty scores k(-2).

Page 25: Bioinformatic PhD. course

Assume that we have sequences with different length

S1

S2

2.2 Semiglobal pairwise alignment

It is meaningless to introduce gaps until both sequences have similar length ….

The most probable alignment should be

How can these alignments be found? Final gaps Initial gaps

Page 26: Bioinformatic PhD. course

2.2 Semiglobal pairwise alignment

C T A C T A C T A C G T

ACT

Initial gaps

Note that

Final gaps

Page 27: Bioinformatic PhD. course

2.2 Semiglobal pairwise alignment

C T A C T A C T A C G T

ACT

The cell contains the score of the best alignment of CTA with the empty sequence.

Given a cell

0 0 0 0 0 0 0 0 0 0 0 00

Page 28: Bioinformatic PhD. course

2.2 Semiglobal pairwise alignment

C T A C T A C T A C G T 0 0 0 0 0 0 0…ACT

The contribution of the initial gaps is disregarded, then

C T A C T A C T A C G T 0 0 0 0 0 0 0…A 1C 2T 3

but, what happens with the final gaps?

Page 29: Bioinformatic PhD. course

2.2 Semiglobal pairwise alignment

C T A C T A C T A C G T 0 0 0 0 0 0 0…A 1C 2T 3

Practice with the alggen tool.

… by checking the last row for the best score.

How does the algorithm search for the best alignment?

Page 30: Bioinformatic PhD. course

2.2 Affine-gap model score

Given the following alignments that have the same score …a g t a c c c c g t a ga g t - c c - - g t a -

a g t a c c c c g t a ga g t - c - c - g t a -

a g t a c c c c g t a ga g t - c - - c g t a -

a g t a c c c c g t a ga g t - - c c - g t a -

a g t a c c c c g t a ga g t - - c - c g t a -

a g t a c c c c g t a ga g t - - - c c g t a -

Which is the most reliable case from a biological point of view?

Page 31: Bioinformatic PhD. course

2.2 Affine-gap model score

Then, how can we distinguish betweenconsecutive gaps and separated gaps?

a g t a c c c c g t a ga g t - - c - c g t a -

a g t a c c c c g t a ga g t - - - c c g t a -

By scoring the opening gaps greater than the extension gaps,

for instance, -10 and -0.5.

Then, the penalty of k consecutive gaps becomes OG + (k-1) EG

which is an affine-gap function.

How is the best alignment found?.

Page 32: Bioinformatic PhD. course

C T A C T A C T A C G T ACTGA

2.2 Affine-gap model score

Smallest arrows: refer to the introduction of an opening gap.Largest arrows: refer to the introduction of an extension gap.

But from which cell do the largest arrows originate?

Page 33: Bioinformatic PhD. course

C T A C T A C T A C G T ACTGA

2.2 Affine-gap model score

In both cases we know which cell contributes with the minimum penalty score.

Acces to clustalW: http://www.ebi.ac.uk/clustalw

Page 34: Bioinformatic PhD. course

2.2 Local alignment

Given two sequences, we can consider the alignments of all their substrings…

…how can the best of them be found?

Two questions arise:

- how can the alignments be compared?

- how can the best one be selected?

Page 35: Bioinformatic PhD. course

2.2 Local alignment

Given a path

Imagine the graph of the scores:can the best subalignments be detected?

accaccacaccacaacgagcata … acctgagcgatat

acc..t

…It suffices to compare the value of each cell with zero!