motif -...

72

Upload: others

Post on 19-Jan-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Motifinference

Page 2: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Dispersedrepeatmotifsormotifscommontoasetof

strings

Page 3: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Motifsearch�

Motifinference

Search

aknownmotif

atext

=)

positionsinthetext

wherethemotif

is\found"

Inferencea

setofproperties

atext

=)

motifssatisfying

theproperties

Page 4: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Motifsearch(verybrie y)

W

hatisthebestwayofrepresentingamotif?

pattern

positionweightmatrix

orpro�leHMM

30-40%

falsenegatives

45-60%

falsepositives

neuralnetworks

betterismoreexamples

Page 5: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Exampleofapositionweightmatrix:Positions3to9of

theCRP

bindingsite

T

T

G

T

G

G

C

T

T

T

T

G

A

T

A

A

G

T

G

T

C

A

T

T

T

G

C

A

C

T

G

T

G

A

G

A

T

G

C

A

A

A

G

T

G

T

T

A

A

A

T

T

T

G

A

A

T

T

G

T

G

A

T

A

T

T

T

A

T

T

A

C

G

T

G

A

T

A

T

G

T

G

A

G

T

T

G

T

G

A

G

C

T

G

T

A

A

C

C

T

G

T

G

A

A

T

T

G

T

G

A

C

G

C

C

T

G

A

C

T

T

G

T

G

A

T

T

T

G

T

G

A

T

G

T

G

T

G

A

A

C

T

G

T

G

A

C

A

T

G

A

G

A

C

T

T

G

T

G

A

G

Page 6: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Correspondingfrequencyandlog-likelihoodposition

weightmatrices

Frequencymatrix

A

0.35

0.043

0

0.043

0.13

0.83

0.26

C

0.17

0.087

0.043

0.043

0

0.043

0.3

G

0.13

0

0.78

0

0.83

0.043

0.17

T

0.35

0.87

0.17

0.91

0.043

0.087

0.26

Log-likelihoodpositionweightmatrix

A

0.48

-2.5

1

-2.5

-0.94

1.7

0.061

C

-0.52

-1.5

-2.5

-2.5

1

-2.5

0.28

G

-0.94

1

1.6

1

1.7

-2.5

-0.52

T

0.48

1.8

-0.52

1.9

-2.5

-1.5

0.061

Page 7: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Example of a pro�le HMM

i1

d1

m1

i0

b e

i2

d2

m2

i3

d3

m3

CCCCC

1

AGDVK

2

FWYFY

3

X X XX

C

 

X

¡

FY

Ø ü

Page 8: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Di�erent HMM or HMM-related architectures

1 2 3

¢

4

1 2

£

3

¢

4

¤

5

¥

6

¦

1 2 3

¢

4 5

¥

6

¦

1 2 3

¢

4

BLOCKS

META-MEME

profile HMM

HMMER2 "Plan 7"

Ø v

Page 9: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Motifinference:Setofproperties

Mainproperty:motifofinterest=

\conserved"element

Variouspossiblemeasuresfor\conservation"

conservationatthesequencelevel?

conservationatthelevelofphysico-chemical

propertiesofthenucleotidesequences?

Page 10: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

inthistalk:

\letter"conservation

physico-chemicalconservation

TATAAT

runofpyrimidines

TTGNCA

RFXCP

runofhydrophilesaa

oramixtureofboth

TA[AT]N[AT]T

[ILMV][ASG]XXC[ILMV]H[FYW

]P

Page 11: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

inthistalk:

\letter"conservation

physico-chemicalconservation

TATAAT

runofpyrimidines

TTGNCA

RFXCP

runofhydrophilesaa

oramixtureofboth

TA[AT]N[AT]T

[ILMV][ASG]XXC[ILMV]H[FYW

]P

Page 12: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

\Statistical"conservationmeasure

G

T

T

T

T

T

C

T

C

T

G

C

A

T

C

T

G

T

G

T

A

A

C

C

G

G

G

T

A

T

G

T

T

T

G

T

C

T

C

T

G

C

T

T

A

T

C

T

A

T

G

T

C

T

C

T

G

A

G

T

A

T

C

A

G

T

G

T

A

G

G

T

G

T

G

A

A

T

C

A

A

1

1

0

1

7

1

0

2

C

1

1

0

1

2

0

8

1

G

7

1

8

0

0

1

2

0

T

1

7

2

8

1

8

0

7

Page 13: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

\Mostsurprising"setsofwords

PLi

=

1

P�2�

fi�log2

fi�

f�

(relative

entropy)

A G T C

G A C T

T G C A

C G A T

G C A T

f�=14

0=

2

2

2

2

2

weighted

average

ofthelog-likelihood

(theweightsare

thefrequencies)

Page 14: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

\Mostsurprising"setsofwords

PLi

=

1

P�2�

fi�log2

fi�

f�

(relative

entropy)

A G T C

G A C T

T G C A

C G A T

G C A T

f�=14

0=

0

0

0

0

0

Page 15: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

\Mostsurprising"setsofwords

PLi

=

1

P�2�

fi�log2

fi�

f�

(relative

entropy)

A A A A

T T T T

C C C C

G G G G

C C C C

f�=14

10=

2

2

2

2

2

Page 16: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

\Mostsurprising"setsofwords

PLi

=

1

P�2�

fi�log2

fi�

f�

(relative

entropy)

A A A A

A A A A

A A A A

A A A A

A A A A

A

fA

=

11

6

20=

4

4

4

4

4

Page 17: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

\Mostsurprising"setsofwords

PLi

=

1

P�2�

fi�log2

fi�

f�

(relative

entropy)

A A A A

A A A A

A A A A

A A A A

A A A A

fA

=34

2=

0.4

0.4

0.4

0.4

0.4

Page 18: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

\Deterministic"conservationmeasure

\Model"

G

T

G

T

A

T

C

T

2

G

T

T

T

T

T

C

T

2

C

T

G

C

A

T

C

T

2

G

T

G

T

A

A

C

C

2

G

G

G

T

A

T

G

T

2

T

T

G

T

C

T

C

T

2

G

C

T

T

A

T

C

T

2

A

T

G

T

C

T

C

T

2

G

A

G

T

A

T

C

A

2

G

T

G

T

A

G

G

T

2

G

T

G

A

A

T

C

A

Page 19: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Model

A

motifa

wordwrittenoverthesamealphabetasthetext,

oroveradegenerate(physico-chemical)alphabet

A

numberofspeci�cproperties

minimum

numberofoccurrencesthemotifmust

have(quorum)

foreachoccurrence,maximum

numberof

di�erencesallowedinrelationtothemotif

(subs.only,orsubs.andindels)

Page 20: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Infact,thetwoarenotsodi�erent

Page 21: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Exceptperhapsfor:

C

T

G

T

A

T

C

G

C

T

G

A

T

T

C

G

C

T

G

A

G

A

C

G

G

T

G

C

A

T

C

G

C

T

C

G

C

T

C

G

C

T

G

C

G

T

C

G

C

T

G

T

C

T

C

G

C

T

G

C

T

T

C

G

C

T

G

T

C

T

C

G

C

T

G

G

A

T

C

G

A

0

0

0

2

3

1

0

0

C

9

0

1

3

3

0

10

0

G

1

0

9

2

2

0

0

10

T

0

10

0

3

2

9

0

0

Page 22: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

W

hichmay,atcurrenttime,perhapsbebettercaptured

by:

\Model"

C

T

G

N

N

T

C

G

0

C

T

G

T

A

T

C

G

0

C

T

G

A

T

T

C

G

1

C

T

G

A

G

A

C

G

1

G

T

G

C

A

T

C

G

1

C

T

C

G

C

T

C

G

0

C

T

G

C

G

T

C

G

0

C

T

G

T

C

T

C

G

0

C

T

G

C

T

T

C

G

0

C

T

G

T

C

T

C

G

0

C

T

G

G

A

T

C

G

Page 23: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Approachesusingastatisticalconservationmeasure

Objective

Findthesetofwordsthatisthe\mostsurprising

possible"

Itisanoptimisationproblem,whichingeneralleads

toauniquesolution

Algorithm

Onlyapproachpossible:testallsetofwordsand,

foreachofthem,calculatethevalueoftheformula

Tootimeconsuming(O(nNk)),onemusttherefore

useheuristics

Page 24: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

\Heuristic"

Threemainapproaches

Expectation-Maximization

(Lawrenceetal.,Proteins,7:41-51,1990)

MEME(Baileyetal.,MachineLearn.21:51-80)

Gibbssampling

(Lawrenceetal.,Sci.,262:208-214,1993)

Greedyalgorithm

(w)consensus(Hertzetal.,Bioinfo.,15:563-577,

1999)

Page 25: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Gibbssampling

p

forallp

value

ofthe

form

ula:Fp

and

we

startagain

(with

anotherstring)

untilconvergence

m

ax

Fp

or(stochastic)with

prob.

Fp

Pp

Fp

Page 26: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Approachesusinga\deterministic"conservationmeasure

Objective

Givenamodel(alphabetforthemotifsand

propertiessuchasquorum

andmaximum

di�erence

rateallowed),�ndallmotifswhichsatisfythe

properties

Itisanenumerationproblem,whichproducesin

generalvarious(oftenagreatnumberof)solutions

Algorithm

Anexhaustiveapproachispossible

Timecomplexitydependsonproperties

Page 27: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

How

doesthealgorithm

work?

Itdoesnotmattersincethealgorithm

isexact!

Page 28: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

However,thishasingeneraltobefollowedby

aSTATISTICALEVALUATION

ofthemodelsfoundtoclassifythem

accordingto

how

SURPRISING

theyaregiventheremainingofthe

sequences

Page 29: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Onecanmakethemodelsmorecomplex:motifinference

withdi�erencesandanontransitiverelation

Alphabetofmodelscorrespondstogroupsofaminoacids

-

wild

card

M

F

W Y

H

K RD

Q E

N T

L I

V

C

S

A

G

P

Page 30: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Onecanmakethemodelsmorecomplex:motifinference

withdi�erencesandanontransitiverelation

Example

modelswrittenoveraphysico-chemicalalphabet

[A

ST][ILM

V]X

X

[FY

W

][H

K

R]X

[P

G]C

occurrences

0di�erence

1substitution

1deletion

AIAGW

HAPC

ATTAYHSPC

SVMLFLPC

Page 31: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Onecanmakethemodelsmorecomplex:structuredmodels

Smile(Marsanetal.,JCB,7:345-362,2000)

anorderedcollectionofpboxes,pmaximum

ratesof

di�erences,p�

1

intervalsofdistances

(betweensuccessiveboxesinthecollection)

occurrences

quorum

=

3/4

18

TTG

ACT

TAAAAT

17

TTG

ACA

TATAAA

TTG

CCA

trop

loin

TATTAT

17

TTG

TCT

TATAAT

e1

=

2

TTG

ACA

d�

17

1

e2

=

1

TATAAT

Page 32: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Onecanmakethemodelsmorecomplex:structuredmodels

anorderedcollectionofpboxes,pmaximum

ratesof

di�erences,p�

1

intervalsofdistances

(betweensuccessiveboxesinthecollection)

occurrences

quorum

=

3/4

TTG

ACT

18

TAAAAT

16

TTG

ACA

TATAAA

TTG

CCA

too

far

TATTAT

17

TTG

TCT

TATAAT

e1

=

2

TTG

ACA

d�

17

1

e2

=

1

TATAAT

Page 33: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

A

few

applications

\Experimental"set

Escherichia

coli

441sequences,35115nucleotides

Bacillussubtilis

131sequences,13099nucleotides

\Genomic"set

Escherichia

coli

1062sequences,196736nucleotides

Bacillussubtilis

1148sequences,226928nucleotides

Page 34: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

\Experimental"set{MEME

Escherichia

coli

MOTIF1

width

=

46

sites

=

185.2

bits

2.2

2.0

1.7

*

1.5

*

Information

1.3

*

*

content

1.1

*

*

(10.0bits)

0.9

**

*

0.7

*

**

*

0.4

*

**

*

0.2

***

*

**

***

**

*

0.0

----------------------------------------------

Multilevel

AAATAAAAGTTGACATTTTTTGGAGTAAATGGTATAATGCGCCCCC

consensus

CTTATTTCT

TGACAACGCGCCCAATTTGTT

A

C

T

CGGGGA

sequence

C

CTA

C

CACGAATGTCCGCC

A

A

T

GGC

A

C

T

C

Page 35: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

\Experimental"set{MEME

Bacillussubtilis

MOTIF

1

width=

30

sites

=

121.0

bits2.2

2.0

1.7

1.5

Information

1.3

*

*

content

1.1

*

*

*

(11.6

bits)

0.9

**

**

*

0.7

***

**

*

*

0.4

***

**

***

0.2

******

**

*******

0.0

------------------------------

Multilevel

TTGACATTATTTTAAAAATATGATATAATA

consensus

TTATAATAAAATTTTGT

G

A

G

sequence

C

CC

AG

T

Page 36: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

\Experimental"set{Combinatorialalgorithm

(1box)

Escherichia

coli

Bacillussubtilis

ATAATGCGG

34

3.90

24

TATAATA

94

48.06

32

TATAATGCGC

23

1.60

19

GTATAAT

74

34.34

24

Family1

ATAATGCGC

30

5.75

17

TGTTATA

66

34.96

15

TGTGTATA

47

15.85

16

TTTTACA

76

45.96

13

ACAATGCGC

24

3.85

15

ATAATAT

82

52.52

13

GTTGACAC

36

10.80

14

GTGACA

68

39.76

12

TCACACTT

36

11.10

13

TTTACAA

75

48.56

10

Family2

TGACACTT

38

12.35

13

GTTGAC

66

40.10

10

GCTGACA

64

31.55

12

TTGACA

92

66.34

10

ACACTTAT

41

14.95

12

ATGATA

10

80.26

10

TTGACACT

37

13.75

11

TTACGCTG

39

12.80

14

Family3

TGTTACGC

39

14.45

12

TTTACGCT

44

17.85

11

Family4

TTTTTTTTTC

23

5.40

11

Family5

GCGCCCC

44

18.85

10

Page 37: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

\Experimental"set{Combinatorialalgorithm

(2boxes)

Escherichia

coli

Bacillussubtilis

[4,6]

[6,8]

[9,11]

[14,16][15,17]

[17,19][16,18]

[19,21][18,20]

[22,24]

[5,7]

[7,9][8,10]

[10,12][11,13][12,14][13,15]

[20,22][21,23]

[23,25][24,26]

Χ2

[4,6]

[6,8]

[9,11]

a

[14,16][15,17]

[17,19][16,18]

[19,21][18,20]

[22,24]

[5,7]

[7,9][8,10]

[10,12][11,13][12,14][13,15]

[20,22][21,23]

[23,25][24,26]

Χ2

distances between tw

o parts of a model

8 9

121110

TTATTC_TATAAT

TTGACT_ATAATG

distances between tw

o parts of a model

TTGACA_TATAAT

b18171615141312

TTGACT_TAAAAT

TTGACT_TAAAAT

Page 38: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

\Genomic"set{MEME

Escherichia

coli

MOTIF

1

width=

30

sites

=

111.4

bits2.2

2.0

1.7

1.5

Information

1.3

*

*

content

1.1

*

*

*

*

(12.3

bits)

0.9

*

*

*

*

0.7

***

****

0.4

**

****

*****

0.2

***

*****

*******

0.0

------------------------------

Multilevel

AATTTTAAATTGTGATCTAAATCACATATT

consensus

CGAAGATTTA

C

AGTGT

G

ATAA

sequence

G

G

TAGT

G

Page 39: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

\Genomic"set{MEME

Escherichia

coli

MOTIF

2

width=

39

sites

=

128.7

bits

2.2

2.0

1.7

1.5

Information

1.3

content

1.1

(12.0

bits)

0.9

*

0.7

*

*

*

0.4

**

*

*

*

*

*

**

*

*

0.2

****

****

*

*

***

****

*

**

*

0.0---------------------------------------

Multilevel

TAATTAATATACACAATTTTTTTTTTATTTTCATGATTT

consensus

AC

AATTATCTAGTTAAAACAAGAATAAAAT

TCAAA

sequence

C

C

CGTA

C

GG

G

A

C

T

C

C

Page 40: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

\Genomic"set{MEME

Escherichia

coli

MOTIF

3

width=

12

sites

=

181.4

bits

2.2

2.0

1.7

*

1.5

*

Information

1.3

**

content

1.1

**

(6.2bits)

0.9

**

0.7

**

0.4

**

*

0.2

********

**

0.0

------------

Multilevel

CGCCCTGTTTGC

consensus

T

GACTCCGTG

sequence

AGG

ACT

Page 41: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

\Genomic"set{MEME

Bacillussubtilis

MOTIF

1

width=

12

sites

=

308.7

bits

2.2

2.0

1.7

1.5

Information

1.3

content

1.1

**

(5.6bits)

0.9

**

0.7

**

0.4

******

0.2

********

0.0

------------

Multilevel

AAAAAAAGGAGG

consensus

TGG

ACGAA

sequence

CT

T

T

Page 42: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

\Genomic"set{MEME

Bacillussubtilis

MOTIF

2

width

=

22

sites=

54.4

bits

2.2

2.0

1.7

1.5

Information

1.3

content

1.1

(8.3

bits)

0.9

0.7

**

*

*

0.4

***

*

**

*

*

0.2

******

*****

*

**

**

0.0

----------------------

Multilevel

GGCAGCAGCCCGTGCAGAGCGA

consensus

C

T

C

AAA

GAATACCGAG

sequence

G

T

A

CTAAC

Page 43: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

\Genomic"set{MEME

Bacillussubtilis

MOTIF

3

width

=

43

sites

=

173.2

bits

2.2

2.0

1.7

1.5

Information

1.3

content

1.1

(11.3bits)

0.9

0.7

*

*

0.4

**

*

**

*

*

0.2

*****

*

*****

********

**

*

*

*

*

0.0

-------------------------------------------

Multilevel

TTTTTTCATAATTTTTTTTTTTTTCTTTTTTTATTTAATATTT

consensus

CCCCCAACACCAACCACACACCCTCA

ACCTCAACTATAGA

sequence

CTT

TT

CA

G

C

G

T

C

C

Page 44: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

\Genomic"set{Combinatorialalgorithm

(1box)

Escherichia

coli

Bacillussubtilis

CCTGAC

573

424.60

39

TATGATA

627

407.05

91

CTGACG

587

439.70

38

TATCATA

615

403.00

84

Family1

CTGACA

701

557.00

36

TATAATAA

445

277.95

58

TCCTGA

671

538.70

30

TTATTATA

439

273.85

57

CCCTGA

575

446.80

28

TACTATA

491

325.70

54

GTCAGG

576

412.10

47

ATGATAA

617

477.10

36

TGTCAG

702

555.00

37

ATGAGAA

500

377.15

29

Family2

CATCAG

711

574.60

32

TGAGAAA

520

417.85

19

CGTCAG

580

443.60

32

ATCAGG

689

553.30

32

TTTTCTG

553

419.20

31

TGACAAA

510

405.20

21

CTCTTTT

464

348.50

25

Family3

TTTCTGT

469

357.05

23

TTTTCAG

531

416.40

23

CTGATTT

498

384.60

23

CAGAAAA

539

410.55

29

CCTTTTT

638

413.05

95

CTGAAAA

525

407.75

24

CCTTTTC

496

291.10

84

Family4

GAGAAAA

460

359.50

19

CTCTTTT

600

391.80

81

AGATAAA

512

415.60

16

CTTTTCT

613

410.90

77

GTGAAAA

509

414.75

16

CTTTTTC

652

451.20

76

etc

Page 45: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

\Genomic"set{Combinatorialalgorithm

(2boxes)

Escherichia

coli

Bacillussubtilis

19 23 27 31 35 Χ2

[4,6]

[6,8]

[9,11]

[14,16][15,17]

[17,19][16,18]

[19,21][18,20]

[22,24]

[5,7]

[7,9][8,10]

[10,12][11,13][12,14][13,15]

[20,22][21,23]

[23,25][24,26]

TTGACA_TATAAT

TTGACA_TATAAT

GAAAAA_TTTTTC

distances between tw

o parts of a model

b

ATTGAC_TATAAT

a

[4,6]

[6,8]

[9,11]

[14,16][15,17]

[17,19][16,18]

[19,21][18,20]

[22,24]

[5,7]

[7,9][8,10]

[10,12][11,13][12,14][13,15]

[20,22][21,23]

[23,25][24,26]

Χ2

13 17 21 25 29 31

TTGTGA_TCACAT

TGTGAT_ACATTT

TGTGAT_TCACAT

TGTGAT_TCACAT

distances between tw

o parts of a model

Page 46: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

\Noise"inthedata

Approachesusingastatisticalconservationmeasure

Donotselectanoccurrenceinasequenceifthe

scoreobtainedisbelow

agiventhresholdforallp

Approachesusingadeterministicconservationmeasure

Quorum

Page 47: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Variablelengthofthemotifs

Approachesusingastatisticalconservationmeasure

Problem

:therelativeentropyisalwayspositive

andcanonlyincrease

Twopossiblesolutions

Normalizetheentropybythematrixlength

Estimatea\p-value"

Approachesusingadeterministicconservationmeasure

Noproblem

Page 48: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Variousdi�erentfamiliesofmotifsinasamesequence

dataset

Approachesusingastatisticalconservationmeasure

Variousmatricesarekept

Approachesusingadeterministicconservationmeasure

Noproblem

(onthecontrary)

Page 49: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

\Toomany"motifsfoundbytheapproachesusinga

deterministicconservationmeasure

A

posterioristatisticalevaluationofthemotifsfound

Careful!Di�erentingeneralfrom

thestatistics

employedbyGibbs

A

prioriprobabilityofagivenmotif(wordorsetof

words)

Sameprobabilitybutestimatedbysimulation

ApplicationofmethodssuchasGibbsonthemotifs

initiallyfoundbyanexhaustivesearch

Comparisonwithobservedon\counter-exampledata"

Page 50: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Otherconstraints

Palindromicorrepeatedmotifs

Quiteafew

approachesmayconsidersuchmotifs

Positioninrelationtoabiologicallandmarkinthesequence

Someapproaches(vanHeldenetal.,NAR,

28:1808-1818,2000inparticular)takethisinto

account(duringtheidenti�cationsteporatprinting

time)

Page 51: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

New

approaches

Inferencefrom

asetofphylogeneticallyrelatedsequences

(\Phylogeneticfootprinting")

Simplewayofconstructingasetofmolecular

sequencesthatisreducedinsizeandpotentially

containsless\noise"

Motifconservationmeasureswhichtakeinto

accountthephylogenyoftheorganisms(Blanchette

etal.,ISMB

2000,

http://ismb00.sdsc.edu/technical-program.html)

Page 52: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Phylogeneticfootprinting{Mainidea

A

setofphylogeneticallyrelatedsequences

TTCG

ATCG

AACG

ATGG

TTCG

...AACG......AATG...

...TACG......TTCG...

1

1

1

0

0

1

1

0

total:5

Page 53: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Phylogeneticfootprinting{A

hintofthediÆculties

possiblyevolutionary

unrelatedsequences

\our"motifs(motifs)

TATA

AAAT

AAAT

AATA

AAAT

TAAA

\ancestor"ofastar-tree

(butnotthemostparsimonious)

Page 54: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Phylogeneticfootprinting{A

hintofthediÆculties

evolutionaryrelated

sequences(orthologs)

themotifsweshouldseek

motif

\ancestor"ofthe\true"evolutionarytree

(underparsimony)forthespeciesconcerned

Page 55: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Phylogeneticfootprinting{A

hintofthediÆculties

evolutionaryrelated

sequences(orthologs)

themotifsweshouldseek

??

plusotherevolutionaryrelated(ina

(di�erentway)sequences(paralogs)

Page 56: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Phylogeneticfootprinting{A

hintofthediÆculties

evolutionaryrelated

sequences(orthologs)

themotifsweshouldseek

????

how

tomodelsuch

\multi-dimensionalconservation"

?

plusotherevolutionaryrelated(ina

(di�erentway)sequences(paralogs)

plusevolutionaryunrelatedsequences

Page 57: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

A

specialuseofphylogeneticfootprinting{Gene�nding

by\purehomology"

Page 58: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

A

veryelementaryview

ofaneukaryoticgene

3

nucleotides

(codon)!

1

am

ino

acid

5'U

TR

5'

3'U

TR

3'

exon

intron

startcodon

G

T

(donor

site)

AG

(acceptor

site)

stop

codon

splicing

gene

!

protein

5'U

TR

and

3'U

TR

:transcribed

into

R

N

A

butnottranslated

Page 59: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Gene�nding{A

few

generalities

Detectionbysignal

Promotersequence(verydiÆcult)

Splicing(donorandacceptor)sites

PolyA

signal

Detectionbydi�erenceofcomposition

Themostcommon:di�erentk-mercounts(oftenk=

6)

Detectionbyhomologywith\known"(storedina

database)sequence

(Observehomologyis\stronger"thansimilarity)

Page 60: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

A

complicatedcaseofgene�nding{\Orphangenes"

Anorphangeneisageneforwhichnohomology(inthe

sensehereof\strongenough"similarity)hasbeen

detectedwiththesequencesstoredinthedatabases

Page 61: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Mainideaaroundtheproblem

Anorphanmayhave\parents",thatisotherorphanslike

itselfwhichareitsHOMOLOGS(havingcommon

ancestor)

ORTHOLOGSpossiblymoresimilar

PARALOGS

possiblyhavingdivergedmore

inbothcases,havingpossiblydi�erentgenestructures

(i.e.

adi�erentnumberofexons)

Page 62: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Additionalhypothesis

(importantbutisitalwaysjusti�ed?)

Exonsare\betterconserved"or,moreaccurately,

\di�erentlyconserved"thanintronsor5'UTR

or3'UTR

orintergenicsequence

Page 63: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Flavourofmethod

Findstructurebycomparingorphanswhicharehomologs

usingasinformationonlythebareessentials(methodby

\purehomology")

Usingadynamicprogrammingapproachwithafew

twists

Sequencesarecomposedofcodingandnoncoding

regions

Therearetwopotentialtypesof\errors"

Nature's(gaps,substitutions)

Man's(sequencingreadingerrors=

\frameshifts")

Utopia(Blayoetal.,acceptedTCS)

Page 64: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Objective

Findbestassemblyofexons

whichsatis�es\bareessentials"genemodel

where\best"meanshighestscoringassemblageofexons

Page 65: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

W

hyacombinatorial,\bareessentials"typeofapproach?

Itdoesnotsubstituteforother,statisticalinparticular,

approaches

Itcannot(perhaps)evencompetewiththem

(itwasnot

meantto)

BUTIt

isGENERIC

Itallows,indeedobligestothinkoveragainour

notionsof\conservation"and,inparticular,ofthe

non-conservationofnon-codingregions

Itisindependentofwhatcanbelearnedfrom

speci�ccharacteristicsofknownexamplesofgenes

Page 66: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Preliminaryapplications(1)

13ADH

proteingenesofplants(amongthem

Arabidopsis

thaliana)

dicoandmonocotyledones

oneparalogand12orthologs

ofdi�erentgenestructuresandlengths

5to10exons

from

942bp.to1046bp.

Page 67: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

-4500

-4000

-3500

-3000

-2500

-2000

-1500

-1000

-500

0

0 500 1000 1500 2000 2500

D84

240

M36

469

M59

082

U36

586

U53

701

U63

931

U65

972

X02

915

X04

050

X54

106

Z24

755

X12733 (9 exons)

X12733 compared with 11 related sequences with pam120, intronIndel 20. Specif : 97% Sensit : 98%

’annot’’D84240’’M36469’’M59082’’U36586’’U53701’’U63931’’U65972’’X02915’’X04050’’X54106’’Z24755’

Page 68: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Sensitivityandspeci�city

Sensitivity

sensitivity=

numberofcorrectlypredicteditems

numberofactualitems

=

TP

TP

+

FN

Speci�city

specificity=

numberofcorrectlypredicteditems

numberofpredicteditems

=

TP

TP

+

FP

Page 69: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Preliminaryapplications(2)

7genesfrom

amultigenefamilyinArabidopsisthaliana

ofunknownfunction

goingbythenameofMYST

ofdi�erentgenestructuresandlengths

13to15exons

from

1848bp.to2040bp.

Page 70: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

-7000

-6000

-5000

-4000

-3000

-2000

-1000

0

0 1000 2000 3000 4000 5000 6000 7000

MY

ST

2 M

YS

T3

MY

ST

4 M

YS

T5

MY

ST

6 M

YS

T7

MYST1 (15 exons)

MYST1 compared with 6 related sequences with pam120, intronIndel 20. Specif : 93% Sensit : 97%

’annot’’MYST2’’MYST3’’MYST4’’MYST5’’MYST6’’MYST7’

Page 71: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

-7000

-6000

-5000

-4000

-3000

-2000

-1000

0

0 500 1000 1500 2000 2500 3000 3500

MY

ST

1 M

YS

T3

MY

ST

4 M

YS

T5

MY

ST

6 M

YS

T7

MYST2 (13 exons)

MYST2 compared with 6 related sequences with pam120, intronIndel 20. Specif : 93% Sensit : 98%

’annot’’MYST1’’MYST3’’MYST4’’MYST5’’MYST6’’MYST7’

Page 72: Motif - pbil.univ-lyon1.frpbil.univ-lyon1.fr/members/duret/cours/chile161001/cours/Sagot-cours.pdf · Motif searc h (v ery brie y) What is the b est w a y of represen ting a motif?

'&

$%

Mainidea(currentlybeingexplored)

Puttingtogethermotifinferenceandgenedetectionby

multiplecomparison