bioinformatic phd. course

100
Bioinformatic PhD. course Bioinformatics Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) LSI Dep. de Llenguatges i Sistemes Informàtics BSC Barcelona Supercomputing Center Universitat Politècnica de Catalunya

Upload: carnig

Post on 07-Jan-2016

36 views

Category:

Documents


3 download

DESCRIPTION

Bioinformatic PhD. course. Bioinformatics Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) LSI Dep. de Llenguatges i Sistemes Informàtics BSC Barcelona Supercomputing Center Universitat Politècnica de Catalunya. Contents. 1. Biological introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Bioinformatic PhD. course

Bioinformatic PhD. course

Bioinformatics

Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

LSI Dep. de Llenguatges i Sistemes InformàticsBSC Barcelona Supercomputing Center

Universitat Politècnica de Catalunya

Page 2: Bioinformatic PhD. course

Contents

1. Biological introduction

Exact Extended Approximate

6. Projects: PROMO, MREPATT, …

5. Sequence assembly

2. Comparison of short sequences ( up to 10.000bps)

Dot Matrix Pairwise align. Multiple align. Hash alg.

3. Comparison of large sequences ( more that 10.000bps)

Data structures Suffix trees MUMs

4. String matching

Page 3: Bioinformatic PhD. course

String matching

1. (Exact) String matching of one pattern

2. (Exact) String matching of many patterns

3. Extended string matching

3. Approximate string matching (Dynamic programming)

• Flexible pattern matching in stringsG. Navarro and M. Raffinot, 2002, Cambridge Uni. Press

• Algorithms on strings, trees and sequencesD. Gusfield, Cambridge University Press, 1997

Page 4: Bioinformatic PhD. course

String matching

Definition: given a long text T and a set of k patterns p1,p2,…,pk, the string matching problem is to find

all the ocurrences of all the patterns in the text T.

On-line algorithms: the patterns are known.

Off-line algorithms: the text is known.

• Only one pattern (exact and approximated)• Five, ten, hundred, thusand,.. patterns (exact)

• Suffix trees

Page 5: Bioinformatic PhD. course

Master Course

First part:

(Exact) string matching

Page 6: Bioinformatic PhD. course

String matching: one pattern

For instance, given the sequence

CTACTACTACGTCTATACTGATCGTAGCTACTACATGC

search for the pattern ACTGA.

How does the string algorithms made the search?

and for the pattern TACTACGGTATGACTAA

Page 7: Bioinformatic PhD. course

String Matching: Brute force algorithm

Given the pattern ATGTA, the search is

G T A C T A G A G G A C G T A T G T A C T G ...A T G T A

A T G T A

A T G T A

A T G T A

A T G T A

A T G T A

Example:

Page 8: Bioinformatic PhD. course

String Matching: Brute force algorithm

Connect to

http://www-igm.univ-mlv.fr/~lecroq/string/index.html

and open Brute Force algorithm

Page 9: Bioinformatic PhD. course

String Matching of one pattern

The cost of Brute Force algorithm is O(nm),

Can the search be made with lower cost?

CTACTACTACGTCTATACTGATCGTAGCTACTACATGC

TACTACGGTATGACTAA

Factor search

Prefix search

Suffix search

and the expected number of comparisons?

Page 10: Bioinformatic PhD. course

String matching of one pattern

How does the string algorithms made the search?

There is a sliding window along the text against which the pattern is compared:

Pattern :

Text :

Which are the facts that differentiate the algorithms?

1. How the comparison is made.2. The length of the shift.

At each step the comparison is made and the window is shifted to the right.

Page 11: Bioinformatic PhD. course

String Matching: Brute force algorithm

Text :

Pattern :

From left to right: prefix search

• Which is the next position of the window?

• How the comparison is made?

Pattern :

Text :

The window is shifted only one cell

The cost is O(mn).

Page 12: Bioinformatic PhD. course

String Matching: one pattern

Most efficient algorithms (Navarro & Raffinot)

2 4 8 16 32 64 128 256

64

32

16

8

4

2

| |

Length of the pattern

Horspool

BNDMBOM

BNDM : Backward Nondeterministic Dawg Matching

BOM : Backward Oracle Matching

w

Page 13: Bioinformatic PhD. course

String Matching: Horspool algorithm

Text :

Pattern :From right to left: suffix search

• Which is the next position of the window?

• How the comparison is made?

Pattern :

Text : a

It depends of where appears the last letter of the text, say it ‘a’, in the pattern:

a a a

Then it is necessary a preprocess that determines the length of the shift.

aa a

a a a

Page 14: Bioinformatic PhD. course

String Matching: Horspool algorithm

Given the pattern ATGTA, the shift table is A 4C 5G 2T 1

And the search: G T A C T A G A G G A C G T A T G T A C T G ...A T G T A

A T G T A

A T G T A

A T G T A A T G T A

A T G T A

Example:

Page 15: Bioinformatic PhD. course

String Matching: Horspool algorithm

Given the pattern ATGTA, the shift table is A 4C 5G 2T 1

And the search: G T A C T A G A G G A C G T A T G T A C T G ...A T G T A

A T G T A

A T G T A

A T G T A A T G T A

A T G T A A T G T A

Example:

…http://www-igm.univ-mlv.fr/~lecroq/string/index.html

Page 16: Bioinformatic PhD. course

String Matching: one pattern

The most efficient algorithms (Navarro & Raffinot)

2 4 8 16 32 64 128 256

64

32

16

8

4

2

| |

Length of the pattern

Horspool

BNDMBOM

BNDM : Backward Nondeterministic Dawg Matching

BOM : Backward Oracle Matching

w

What happens with many patterns?

Page 17: Bioinformatic PhD. course

String matching: many patterns

Given the sequence

CTACTACTACGTCTATACTGATCGTAGCTACTACATGC

Search for the patterns

ACTGACTGTCTAATT

ACTGATCTTTGTAGCAATACTACATGCACTGA.

Page 18: Bioinformatic PhD. course

Horspool for many patterns

Search for ATGTATG,TATG,ATAAT,ATGTG

4. Start the search

T A

A

G

GA

TTT

T

G

A

A

AA T

1. Build the trie of the inverted patterns

2. lmin=4A 1C 4 (lmin)G 2T 1

3. Table of shifts

Page 19: Bioinformatic PhD. course

Horspool for many patterns

Search for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Page 20: Bioinformatic PhD. course

Horspool for many patterns

Search for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Page 21: Bioinformatic PhD. course

Horspool for many patterns

Search for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Page 22: Bioinformatic PhD. course

Horspool for many patterns

Search for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Page 23: Bioinformatic PhD. course

Horspool for many patterns

Search for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Page 24: Bioinformatic PhD. course

Horspool for many patterns

Search for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Page 25: Bioinformatic PhD. course

Horspool for many patterns

Search for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Short Shifts!

Page 26: Bioinformatic PhD. course

AA 1 AC 3 (LMIN-L+1)AG 3AT 1CA 3CC 3CG 3…

2 símbols

Horspool to Wu-Manber

How do we can increase the length of the shifts?

With a table shift of l-mers with the patterns ATGTATG,TATG,ATAAT,ATGTG

AA 1AT 1GT 1TA 2TG 2

A 1C 4 (lmin)G 2T 1

1 símbol

Page 27: Bioinformatic PhD. course

Wu-Manber algorithm

Search for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

into the text: ACATGCTATGTGACATAATA

AA 1AT 1GT 1TA 2TG 2

Experimental length: log|Σ| 2*lmin*r

Page 28: Bioinformatic PhD. course

String matching of many patterns

5 10 15 20 25 30 35 40 45

8

4

2

| |

Wu-Manber

SBOMLmin

(5 patterns)

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM(10 patterns)

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM

(100 patterns)

Page 29: Bioinformatic PhD. course

String Matching: one pattern

The most efficient algorithms (Navarro & Raffinot)

2 4 8 16 32 64 128 256

64

32

16

8

4

2

| |

Length of the pattern

Horspool

BNDMBOM

BNDM : Backward Nondeterministic Dawg Matching

BOM : Backward Oracle Matching

w

Page 30: Bioinformatic PhD. course

BNDM algorithm

• How the shift is determined?

• How the comparison is made?

Text :

Pattern :

Searches for suffixes of T that are factors of P

This state is expressed with an array D of bits:

D2 = 1 0 0 0 1 0 0

How the next state can be obtained?

D = D<<1 & B(x)

Given the mask B(x) of x, the cells where character x appears into the pattern

D3 = (0 0 0 1 0 0 0) & (0 0 1 1 0 0 0 ) = (0 0 0 1 0 0 0 )

If B(x) = ( 0 0 1 1 0 0 0) then

?

x

Page 31: Bioinformatic PhD. course

BNDM algorithm: example

Given the pattern ATGTA,

the mask of characters is:

B(A) = ( 1 0 0 0 1 )B(C) = B(G) = B(T) =

Page 32: Bioinformatic PhD. course

BNDM algorithm: example

Given the pattern ATGTA,

the mask of characters is:

B(A) = ( 1 0 0 0 1 )B(C) = ( 0 0 0 0 0 )B(G) = ( 0 0 1 0 0 )B(T) = ( 0 1 0 1 0 )

Page 33: Bioinformatic PhD. course

BNDM algorithm: example

Given the pattern ATGTA,

Given the text :G T A C T A G A G G A C G T A T G T A C T G ...A T G T A

A T G T A

A T G T A

A T G T A

the mask of characters is:

B(A) = ( 1 0 0 0 1 )B(C) = ( 0 0 0 0 0 )B(G) = ( 0 0 1 0 0 )B(T) = ( 0 1 0 1 0 )

D1 = ( 0 1 0 1 0 )D2 = ( 1 0 1 0 0 ) & ( 0 0 0 0 0 ) = ( 0 0 0 0 0 )

D1 = ( 0 0 1 0 0 )D2 = ( 0 1 0 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 0 0 0 )

D1 = ( 1 0 0 0 1 )D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 )D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0) = ( 0 0 1 0 0 )D4 = ( 0 1 0 0 0 ) & ( 0 0 0 0 0) = ( 0 0 0 0 0 )

Page 34: Bioinformatic PhD. course

BNDM algorithm: example

A T G T A

The pattern is ATGTA ,

the masks are:

and the text:G T A C T A G A G G A C G T A T G T A C T G ...A T G T A

B(A) = ( 1 0 0 0 1 )B(C) = ( 0 0 0 0 0 )B(G) = ( 0 0 1 0 0 )B(T) = ( 0 1 0 1 0 )

D1 = ( 1 0 0 0 1 )D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 )D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 1 0 0 )D4 = ( 0 1 0 0 0 ) & ( 0 1 0 1 0 ) = ( 0 1 0 0 0 )D5 = ( 1 0 0 0 0 ) & ( 1 0 0 0 1 ) = ( 1 0 0 0 0 )D6 = ( 0 0 0 0 0 ) & ( * * * * * ) = ( 0 0 0 0 0 )

Pattern found!

Page 35: Bioinformatic PhD. course

Text :

Pattern :

Searches for suffixes of T that are factors of P

BNDM algorithm

• How the shift is determined?

• How the comparison is made?

This state is expressed with an array D of bits:

D = 1 0 0 0 1 0 0

?

Page 36: Bioinformatic PhD. course

Text :

Pattern :

Searches for suffixes of T that are factors of P

BNDM algorithm

• How the shift is determined?

• How the comparison is made?

This state is expressed with an array D of bits:

D = 1 0 0 0 1 0 0

If the left bit is set to one in step i, it means that a prefix of P of length i is equal to a suffix of T, then the window is shifted m-i cells; otherwise it is shifted m cells

Page 37: Bioinformatic PhD. course

String matching: one pattern

The most efficient algorithms (Navarro & Raffinot)

2 4 8 16 32 64 128 256

64

32

16

8

4

2

| |

Long. patró

Horspool

BNDMBOM

BNDM : Backward Nondeterministic Dawg Matching

BOM : Backward Oracle Matching

w

Page 38: Bioinformatic PhD. course

BOM (Backward Oracle Matching)

• How the shifted is determined?

• How the comparison is made?

Text :

Pattern : Automaton: Factor Oracle(1999)

Checks if the suffix is a factor of the pattern

?

Page 39: Bioinformatic PhD. course

Automaton Factor Oracle: properties

Factor Oracle of the word G T A T G T A

GG AT T ATTA

G

G T A T G

but the automaton also recognizes other strings as G T G

then it is usefull only for discard words out as factors!

A T G

G T G

T A T G

Suffixes found before.

Suffixes that have not been found before.

Page 40: Bioinformatic PhD. course

BOM: example

• The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG

• Search: G T A C T A G A A T G T G T A G A C A T G T A T G G T G A...A T G T A T G

• How the comparison is made?

GG AT T ATTA

G

Page 41: Bioinformatic PhD. course

BOM: example

• The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG

• Search: G T A C T A G A A T G T G T A G A C A T G T A T G G T GA T G T A T G

• How the comparison is made?

GG AT T ATTA

G

A T G T A T G

Page 42: Bioinformatic PhD. course

BOM: example

• The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG

• Search G T A C T A G A A T G T G T A G A C A T G T A T G G T G A T G T A T G

• How the comparison is made?

GG AT T ATTA

G

A T G T A T G A T G T A T G

Page 43: Bioinformatic PhD. course

BOM: example

• The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG

• Search : G T A C T A G A A T G T G T A G A C A T G T A T G G T GA T G T A T G

• How the comparison is made?

GG AT T ATTA

G

A T G T A T G A T G T A T G

A T G T A T G

Page 44: Bioinformatic PhD. course

BOM: example

• The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG

• Search : G T A C T A G A A T G T G T A G A C A T G T A T G G T G ...A T G T A T G

• How the comparison is made?

GG AT T ATTA

G

A T G T A T G A T G T A T G

A T G T A T G A T G T A T G

Page 45: Bioinformatic PhD. course

BOM: example

• Es construeix l’autòmata del patró invers: Suposem que el patró és ATGTATG

• Search : G T A C T A G A A T G T G T A G A C A T G T A T G G T G ...A T G T A T G

• How the comparison is made?

GG AT T ATTA

G

A T G T A T G A T G T A T G

A T G T A T G A T G T A T G

A T G T A T G …

Page 46: Bioinformatic PhD. course

BOM (Backward Oracle Matching)

• How the shifted is determined?

• How the comparison is made?

Text :

Pattern : Automaton: Factor Oracle

Checks if the suffix is a factor of the pattern

a

• a is the first mismatch

But what happens with many patterns?

Page 47: Bioinformatic PhD. course

SBOM

• How the shifted is determined?

• How the comparison is made?

Text :

Pattern : Automaton: Factor Oracle

Checks if the suffix is a factor of any pattern

?

Page 48: Bioinformatic PhD. course

Factor Oracle of many patterns

The AFO of GTATGTA, GTAA, TAATA i GTGTA

T A

A

GG AT TT

T

A

G

A

1,4

32

A

Page 49: Bioinformatic PhD. course

SBOM algorithm

Text :

Patrons:

• How the shift is determined?

• How the comparison is made?

a

Autòmaton………… of lenght lmin

• If the a doesn’t appears in the AFO

• If lmin characters have been read

Page 50: Bioinformatic PhD. course

SBOM algorithm : example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG

GG AT TTTA

G A

T A

A1 4

2 3

ACATGCTAGCTATAATAATGTATG

A

Page 51: Bioinformatic PhD. course

SBOM algorithm: example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG

GG AT TTTA

G A

T A

A1 4

2 3

ACATGCTAGCTATAATAATGTATG

A

Page 52: Bioinformatic PhD. course

SBOM algorithm: example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG

GG AT TTTA

G A

T A

A1 4

2 3

ACATGCTAGCTATAATAATGTATG

A

Page 53: Bioinformatic PhD. course

SBOM algorithm: example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG

GG AT TTTA

G A

T A

A1 4

2 3

ACATGCTAGCTATAATAATGTATG

A

Page 54: Bioinformatic PhD. course

SBOM algorithm: example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG

GG AT TTTA

G A

T A

A1 4

2 3

ACATGCTAGCTATAATAATGTATG

A

Page 55: Bioinformatic PhD. course

SBOM algorithm: example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG

GG AT TTTA

G A

T A

A1 4

2 3

ACATGCTAGCTATAATAATGTATG

A

Page 56: Bioinformatic PhD. course

SBOM algorithm: example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG

GG AT TTTA

G A

T A

A1 4

2 3

ACATGCTAGCTATAATAATGT…

A

Page 57: Bioinformatic PhD. course

Alg. Cerca exacta de molts patrons

5 10 15 20 25 30 35 40 45

8

4

2

| |Wu-Manber

SBOMLong. mínima

(5 mots)

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM(10 mots)

Ad AC

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM (1000 mots)

Ad AC

5 10 15 20 25 30 35 40 45

8

4

2

Wu-ManberSBOM

(100 mots)

Ad AC

Page 58: Bioinformatic PhD. course

PhD. Course

Second part:

Extended string matching

Page 59: Bioinformatic PhD. course

Extended string matching

There are characters in the text that represent sets of simbols

1. Classes of characters in the text.

There are characters in the pattern that represent sets of simbols

2. Classes of characters in the pattern.

There are classes of characters represented by oneSymbol. For instace the IUPAC code for the

DNA alphabet is:R = {G,A} Y = {T,C} K = {G,T} M = {A,C} S = {G,C} W = {A,T}

B = {G,T,C } D = {G,A,T} H = {A,C,T} V = {G,C,A} N = {A,G,C,T} (any)

Page 60: Bioinformatic PhD. course

Classes in the text

Algorismes més eficients (Navarro & Raffinot)

2 4 8 16 32 64 128 256

64

32

16

8

4

2

| |

Long. patró

Horspool

BNDMBOM

w

BNDM : Backward Nondeterministic Dawg Matching

BOM : Backward Oracle Matching

Page 61: Bioinformatic PhD. course

Alg. Cerca exacta d’un patró (text on-line)

Algorismes més eficients (Navarro & Raffinot)

2 4 8 16 32 64 128 256

64

32

16

8

4

2

| |

Long. patró

Horspool

BNDMBOM

BNDM : Backward Nondeterministic Dawg Matching

BOM : Backward Oracle Matching

w

Page 62: Bioinformatic PhD. course

Classes in the text :Horspool example

Given the pattern ATGTA

• the shift table is:

A 4C 5G 2T 1R ?…N ?

Page 63: Bioinformatic PhD. course

Classes in the text :Horspool example

Given the pattern ATGTA

• the shift table is:

A 4C 5G 2T 1R 2…N ?

Page 64: Bioinformatic PhD. course

Classes in the text :Horspool example

Given the pattern ATGTA

• the shift table is:

A 4C 5G 2T 1R 2…N 1

Given the text : G T A R T R N A A G G A …A T G T A

A T G T A

A T G T A

Page 65: Bioinformatic PhD. course

Classes in the text :Horspool example

Given the pattern ATGTA

• and the shift table:

A 4C 5G 2T 1R 2…N 1

Given the text : G T A R T R N A A G G A ...A T G T A

A T G T A

A T G T A A T G T A

Page 66: Bioinformatic PhD. course

Alg. Cerca exacta d’un patró (text on-line)

Algorismes més eficients (Navarro & Raffinot)

2 4 8 16 32 64 128 256

64

32

16

8

4

2

| |

Long. patró

Horspool

BNDMBOM

BNDM : Backward Nondeterministic Dawg Matching

BOM : Backward Oracle Matching

w

Page 67: Bioinformatic PhD. course

Classes in the text: BOM

• Com es determina la següent posició de la finestra?

• Com fa la comparació?

Text :

Patró : Autòmata: Factor Oracle

Comproba si el sufix és factor del patró

Però primer analitzem com fa la comparació…

Page 68: Bioinformatic PhD. course

Classes in the text: BOM example

• Es construeix l’autòmata del patró invers: Suposem que el patró és ATGTATG

• I la cerca sobre el text : G T A R T R N A A T G…A T G T A T G

• Com fa la comparació?

GG AT T ATTA

G

No és possible cap millora!

Page 69: Bioinformatic PhD. course

Alg. Cerca exacta de molts patrons

5 10 15 20 25 30 35 40 45

8

4

2

| |Wu-Manber

SBOMLong. mínima

(5 mots)

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM(10 mots)

Ad AC

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM (1000 mots)

Ad AC

5 10 15 20 25 30 35 40 45

8

4

2

Wu-ManberSBOM

(100 mots)

Ad AC

Page 70: Bioinformatic PhD. course

Alg. Cerca exacta de molts patrons

5 10 15 20 25 30 35 40 45

8

4

2

| |Wu-Manber

SBOMLong. mínima

(5 mots)

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM(10 mots)

Ad AC

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM (1000 mots)

Ad AC

5 10 15 20 25 30 35 40 45

8

4

2

Wu-ManberSBOM

(100 mots)

Ad AC

Page 71: Bioinformatic PhD. course

Classes in the text: Set Horspool

Search for the patterns ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

In the text: ARTGNCTATGTGACA…

it’s not possible any improvment!

Page 72: Bioinformatic PhD. course

Master Course

Third part:

Regular expressions matching

Page 73: Bioinformatic PhD. course

Expressions regulars

Una expressió regular ℛ és una cadena sobre Σ U { ε, |, · , * , (, ) } definida recursivament com:

• ε és una expressió regular• Un caràcter de Σ és una expressió regular

• ( ) ℛ és una expressió regular

• ℛ1 · ℛ2 és una expressió regular

• ℛ * és una expressió regular

• ℛ1 | ℛ2 és una expressió regular

Page 74: Bioinformatic PhD. course

Llenguatge regular

El llenguatge representat per una expressió regular és el conjunt dels mots que es poden construir a partir

de l’expressió regular.

El problema de buscar una expressió regular dins el text és el de buscar tots els factors que pertanyen

al respectiu llenguatge regular.

Page 75: Bioinformatic PhD. course

Cerca d’una expressió regular

expressió regular

NFA

Cerca de les ocurrències

DFA

Cerca amb autòmat determinista

Cerca amb el bit-paral.lel Thompson

arbre “parser”

Page 76: Bioinformatic PhD. course

PhD. Course

Fourth part:

Approximate string matching

Page 77: Bioinformatic PhD. course

Approximate string matching

For instance, given the sequence

CTACTACTACGTCTATACTGATCGTAGCTACTACATGC

search for the pattern ACTGA allowing one error…

… but what is the meaning of “one error”?

Page 78: Bioinformatic PhD. course

Edit distance

We accept three types of errors:

The edit distance d between two strings is the minimum number of

substitutions,insertions and deletionsneeded to transform the first string into the second one

d(ACT,ACT)= d(ACT,AC)= d(ACT,C)=d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=

3. Deletion: ACCGTGAT ACCGGAT

2. Insertion: ACCGTGAT ACCGATGAT

1. Mismatch: ACCGTGAT ACCGAGAT

Indel

Page 79: Bioinformatic PhD. course

Edit distance

We accept three types of errors:

The edit distance d between two strings is the minimum number of

substitutions,insertions and deletionsneeded to transform the first string into the second one

3. Deletion: ACCGTGAT ACCGGAT

2. Insertion: ACCGTGAT ACCGATGAT

1. Mismatch: ACCGTGAT ACCGAGAT

d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=

Indel

Page 80: Bioinformatic PhD. course

Edit distance

We accept three types of errors:

The edit distance d between two strings is the minimum number of

substitutions,insertions and deletionsneeded to transform the first string into the second one

3. Deletion: ACCGTGAT ACCGGAT

2. Insertion: ACCGTGAT ACCGATGAT

1. Mismatch: ACCGTGAT ACCGAGAT

d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2

Indel

Page 81: Bioinformatic PhD. course

Edit distance

• ACT and ACT : ACT ACT

• ACTTG and ATCTG:

• ACT and AC: ACT AC-

ACTTG ATCTG

ACT - TGA - TCTG

Given d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2which is the best alignment in every case?

The Edit distance is related with the best alignment of strings

Page 82: Bioinformatic PhD. course

Edit distance

But which is the distance between the strings

ACGCTATGCTATACG and ACGGTAGTGACGC?

… and the best alignment between them?

1966 was the first time this problem was discussed…

and the algorithm was proposed in 1968,1970,…

using the technique called “Dynamic programming”

Page 83: Bioinformatic PhD. course

Edit distance

C T A C T A C T A C G T ACTGA

The cell contains the distance between AC and CTACT.

Page 84: Bioinformatic PhD. course

Edit distance and alignment of strings

C T A C T A C T A C G T A C T GA

?

Page 85: Bioinformatic PhD. course

Edit distance and alignment of strings

C T A C T A C T A C G T 0 A C T GA

?

Page 86: Bioinformatic PhD. course

Edit distance and alignment of strings

C T A C T A C T A C G T 0 1 A C T GA

-C

?

Page 87: Bioinformatic PhD. course

Edit distance and alignment of strings

C T A C T A C T A C G T 0 1 2 A C T GA

- -CT

?

Page 88: Bioinformatic PhD. course

Edit distance and alignment of strings

C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 …A C T GA

- - - - - -CTACTA

Page 89: Bioinformatic PhD. course

Edit distance and alignment of strings

C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 …A ?C ?T ?GA

Page 90: Bioinformatic PhD. course

Edit distance and alignment of strings

C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 …A 1C 2T 3G…A

ACT - - -

Page 91: Bioinformatic PhD. course

Edit distance and alignment of strings

Connect to

http://alggen.lsi.upc.es/docencia/ember/leed/Tfc1.htm

and use the global method.

Page 92: Bioinformatic PhD. course

K-approximate string searching

How this algorithm can be applied

to the approximate search?

to the K-approximate string searching?

Page 93: Bioinformatic PhD. course

K-approximate string searching

C T A C T A C T A C G T A C T G G T G A A …

ACTGA

This cell …

Page 94: Bioinformatic PhD. course

K-approximate string searching

C T A C T A C T A C G T A C T G G T G A A …

ACTGA

This cell gives the distance between (ACTGA, CT…GTA)…

…but we only are interested in the last characters

Page 95: Bioinformatic PhD. course

K-approximate string searching

C T A C T A C T A C G T A C T G G T G A A …

ACTGA

This cell gives the distance between (ACTGA, CT…GTA)…

…but we only are interested in the last characters

Page 96: Bioinformatic PhD. course

K-approximate string searching

* * * * * * C T A C G T A C T G G T G A A …

ACTGA

This cell gives the distance between (ACTGA, CT…GTA)…

…but we only are interested in the last characters…

…no matter where they appears in the text, then…

Page 97: Bioinformatic PhD. course

K-approximate string searching

* * * * * * C T A C G T A C T G G T G A A … 0ACTGA

This cell gives the distance between (ACTGA, CT…GTA)…

…but we only are interested in the last characters…

…no matter where they appears in the text, then…

Page 98: Bioinformatic PhD. course

K-approximate string searching

* * * * * * C T A C G T A C T G G T G A A … 0ACTGA

This cell gives the distance between (ACTGA, CT…GTA)…

…but we only are interested in the last characters…

…no matter where they appears in the text, then…

Page 99: Bioinformatic PhD. course

C T A C T A C T A C G T A C T G G T G A A … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0ACTGA

K-approximate string searching

This cell gives the distance between (ACTGA, CT…GTA)…

…but we only are interested in the last characters…

…no matter where they appears in the text, then

Page 100: Bioinformatic PhD. course

K-approximate string searching

Connect to

http://alggen.lsi.upc.es/docencia/ember/leed/Tfc1.htm

and use the semi-global method.