motif -...
TRANSCRIPT
'&
$%
Motifinference
'&
$%
Dispersedrepeatmotifsormotifscommontoasetof
strings
'&
$%
Motifsearch�
Motifinference
Search
aknownmotif
atext
=)
positionsinthetext
wherethemotif
is\found"
Inferencea
setofproperties
atext
=)
motifssatisfying
theproperties
'&
$%
Motifsearch(verybrie y)
W
hatisthebestwayofrepresentingamotif?
pattern
positionweightmatrix
orpro�leHMM
30-40%
falsenegatives
45-60%
falsepositives
neuralnetworks
betterismoreexamples
'&
$%
Exampleofapositionweightmatrix:Positions3to9of
theCRP
bindingsite
T
T
G
T
G
G
C
T
T
T
T
G
A
T
A
A
G
T
G
T
C
A
T
T
T
G
C
A
C
T
G
T
G
A
G
A
T
G
C
A
A
A
G
T
G
T
T
A
A
A
T
T
T
G
A
A
T
T
G
T
G
A
T
A
T
T
T
A
T
T
A
C
G
T
G
A
T
A
T
G
T
G
A
G
T
T
G
T
G
A
G
C
T
G
T
A
A
C
C
T
G
T
G
A
A
T
T
G
T
G
A
C
G
C
C
T
G
A
C
T
T
G
T
G
A
T
T
T
G
T
G
A
T
G
T
G
T
G
A
A
C
T
G
T
G
A
C
A
T
G
A
G
A
C
T
T
G
T
G
A
G
'&
$%
Correspondingfrequencyandlog-likelihoodposition
weightmatrices
Frequencymatrix
A
0.35
0.043
0
0.043
0.13
0.83
0.26
C
0.17
0.087
0.043
0.043
0
0.043
0.3
G
0.13
0
0.78
0
0.83
0.043
0.17
T
0.35
0.87
0.17
0.91
0.043
0.087
0.26
Log-likelihoodpositionweightmatrix
A
0.48
-2.5
�
1
-2.5
-0.94
1.7
0.061
C
-0.52
-1.5
-2.5
-2.5
�
1
-2.5
0.28
G
-0.94
�
1
1.6
�
1
1.7
-2.5
-0.52
T
0.48
1.8
-0.52
1.9
-2.5
-1.5
0.061
'&
$%
Example of a pro�le HMM
i1
d1
m1
i0
b e
i2
d2
m2
i3
d3
m3
CCCCC
1
AGDVK
2
�
FWYFY
�
3
X X XX
C
X
¡
FY
Ø ü
'&
$%
Di�erent HMM or HMM-related architectures
1 2 3
¢
4
1 2
£
3
¢
4
¤
5
¥
6
¦
1 2 3
¢
4 5
¥
6
¦
1 2 3
¢
4
BLOCKS
META-MEME
profile HMM
HMMER2 "Plan 7"
Ø v
'&
$%
Motifinference:Setofproperties
Mainproperty:motifofinterest=
\conserved"element
Variouspossiblemeasuresfor\conservation"
conservationatthesequencelevel?
conservationatthelevelofphysico-chemical
propertiesofthenucleotidesequences?
'&
$%
inthistalk:
\letter"conservation
physico-chemicalconservation
TATAAT
runofpyrimidines
TTGNCA
RFXCP
runofhydrophilesaa
oramixtureofboth
TA[AT]N[AT]T
[ILMV][ASG]XXC[ILMV]H[FYW
]P
'&
$%
inthistalk:
\letter"conservation
physico-chemicalconservation
TATAAT
runofpyrimidines
TTGNCA
RFXCP
runofhydrophilesaa
oramixtureofboth
TA[AT]N[AT]T
[ILMV][ASG]XXC[ILMV]H[FYW
]P
'&
$%
\Statistical"conservationmeasure
G
T
T
T
T
T
C
T
C
T
G
C
A
T
C
T
G
T
G
T
A
A
C
C
G
G
G
T
A
T
G
T
T
T
G
T
C
T
C
T
G
C
T
T
A
T
C
T
A
T
G
T
C
T
C
T
G
A
G
T
A
T
C
A
G
T
G
T
A
G
G
T
G
T
G
A
A
T
C
A
A
1
1
0
1
7
1
0
2
C
1
1
0
1
2
0
8
1
G
7
1
8
0
0
1
2
0
T
1
7
2
8
1
8
0
7
'&
$%
\Mostsurprising"setsofwords
PLi
=
1
P�2�
fi�log2
fi�
f�
(relative
entropy)
A G T C
G A C T
T G C A
C G A T
G C A T
f�=14
0=
2
2
2
2
2
weighted
average
ofthelog-likelihood
(theweightsare
thefrequencies)
'&
$%
\Mostsurprising"setsofwords
PLi
=
1
P�2�
fi�log2
fi�
f�
(relative
entropy)
A G T C
G A C T
T G C A
C G A T
G C A T
f�=14
0=
0
0
0
0
0
'&
$%
\Mostsurprising"setsofwords
PLi
=
1
P�2�
fi�log2
fi�
f�
(relative
entropy)
A A A A
T T T T
C C C C
G G G G
C C C C
f�=14
10=
2
2
2
2
2
'&
$%
\Mostsurprising"setsofwords
PLi
=
1
P�2�
fi�log2
fi�
f�
(relative
entropy)
A A A A
A A A A
A A A A
A A A A
A A A A
A
fA
=
11
6
20=
4
4
4
4
4
'&
$%
\Mostsurprising"setsofwords
PLi
=
1
P�2�
fi�log2
fi�
f�
(relative
entropy)
A A A A
A A A A
A A A A
A A A A
A A A A
fA
=34
2=
0.4
0.4
0.4
0.4
0.4
'&
$%
\Deterministic"conservationmeasure
\Model"
G
T
G
T
A
T
C
T
2
G
T
T
T
T
T
C
T
2
C
T
G
C
A
T
C
T
2
G
T
G
T
A
A
C
C
2
G
G
G
T
A
T
G
T
2
T
T
G
T
C
T
C
T
2
G
C
T
T
A
T
C
T
2
A
T
G
T
C
T
C
T
2
G
A
G
T
A
T
C
A
2
G
T
G
T
A
G
G
T
2
G
T
G
A
A
T
C
A
'&
$%
Model
A
motifa
wordwrittenoverthesamealphabetasthetext,
oroveradegenerate(physico-chemical)alphabet
A
numberofspeci�cproperties
minimum
numberofoccurrencesthemotifmust
have(quorum)
foreachoccurrence,maximum
numberof
di�erencesallowedinrelationtothemotif
(subs.only,orsubs.andindels)
'&
$%
Infact,thetwoarenotsodi�erent
'&
$%
Exceptperhapsfor:
C
T
G
T
A
T
C
G
C
T
G
A
T
T
C
G
C
T
G
A
G
A
C
G
G
T
G
C
A
T
C
G
C
T
C
G
C
T
C
G
C
T
G
C
G
T
C
G
C
T
G
T
C
T
C
G
C
T
G
C
T
T
C
G
C
T
G
T
C
T
C
G
C
T
G
G
A
T
C
G
A
0
0
0
2
3
1
0
0
C
9
0
1
3
3
0
10
0
G
1
0
9
2
2
0
0
10
T
0
10
0
3
2
9
0
0
'&
$%
W
hichmay,atcurrenttime,perhapsbebettercaptured
by:
\Model"
C
T
G
N
N
T
C
G
0
C
T
G
T
A
T
C
G
0
C
T
G
A
T
T
C
G
1
C
T
G
A
G
A
C
G
1
G
T
G
C
A
T
C
G
1
C
T
C
G
C
T
C
G
0
C
T
G
C
G
T
C
G
0
C
T
G
T
C
T
C
G
0
C
T
G
C
T
T
C
G
0
C
T
G
T
C
T
C
G
0
C
T
G
G
A
T
C
G
'&
$%
Approachesusingastatisticalconservationmeasure
Objective
Findthesetofwordsthatisthe\mostsurprising
possible"
Itisanoptimisationproblem,whichingeneralleads
toauniquesolution
Algorithm
Onlyapproachpossible:testallsetofwordsand,
foreachofthem,calculatethevalueoftheformula
Tootimeconsuming(O(nNk)),onemusttherefore
useheuristics
'&
$%
\Heuristic"
Threemainapproaches
Expectation-Maximization
(Lawrenceetal.,Proteins,7:41-51,1990)
MEME(Baileyetal.,MachineLearn.21:51-80)
Gibbssampling
(Lawrenceetal.,Sci.,262:208-214,1993)
Greedyalgorithm
(w)consensus(Hertzetal.,Bioinfo.,15:563-577,
1999)
'&
$%
Gibbssampling
p
forallp
value
ofthe
form
ula:Fp
and
we
startagain
(with
anotherstring)
untilconvergence
m
ax
Fp
or(stochastic)with
prob.
Fp
Pp
Fp
'&
$%
Approachesusinga\deterministic"conservationmeasure
Objective
Givenamodel(alphabetforthemotifsand
propertiessuchasquorum
andmaximum
di�erence
rateallowed),�ndallmotifswhichsatisfythe
properties
Itisanenumerationproblem,whichproducesin
generalvarious(oftenagreatnumberof)solutions
Algorithm
Anexhaustiveapproachispossible
Timecomplexitydependsonproperties
'&
$%
How
doesthealgorithm
work?
Itdoesnotmattersincethealgorithm
isexact!
'&
$%
However,thishasingeneraltobefollowedby
aSTATISTICALEVALUATION
ofthemodelsfoundtoclassifythem
accordingto
how
SURPRISING
theyaregiventheremainingofthe
sequences
'&
$%
Onecanmakethemodelsmorecomplex:motifinference
withdi�erencesandanontransitiverelation
Alphabetofmodelscorrespondstogroupsofaminoacids
-
wild
card
M
F
W Y
H
K RD
Q E
N T
L I
V
C
S
A
G
P
'&
$%
Onecanmakethemodelsmorecomplex:motifinference
withdi�erencesandanontransitiverelation
Example
modelswrittenoveraphysico-chemicalalphabet
[A
ST][ILM
V]X
X
[FY
W
][H
K
R]X
[P
G]C
occurrences
0di�erence
1substitution
1deletion
AIAGW
HAPC
ATTAYHSPC
SVMLFLPC
'&
$%
Onecanmakethemodelsmorecomplex:structuredmodels
Smile(Marsanetal.,JCB,7:345-362,2000)
anorderedcollectionofpboxes,pmaximum
ratesof
di�erences,p�
1
intervalsofdistances
(betweensuccessiveboxesinthecollection)
occurrences
quorum
=
3/4
18
TTG
ACT
TAAAAT
17
TTG
ACA
TATAAA
TTG
CCA
trop
loin
TATTAT
17
TTG
TCT
TATAAT
e1
=
2
TTG
ACA
d�
�
17
�
1
e2
=
1
TATAAT
'&
$%
Onecanmakethemodelsmorecomplex:structuredmodels
anorderedcollectionofpboxes,pmaximum
ratesof
di�erences,p�
1
intervalsofdistances
(betweensuccessiveboxesinthecollection)
occurrences
quorum
=
3/4
TTG
ACT
18
TAAAAT
16
TTG
ACA
TATAAA
TTG
CCA
too
far
TATTAT
17
TTG
TCT
TATAAT
e1
=
2
TTG
ACA
d�
�
17
�
1
e2
=
1
TATAAT
'&
$%
A
few
applications
\Experimental"set
Escherichia
coli
441sequences,35115nucleotides
Bacillussubtilis
131sequences,13099nucleotides
\Genomic"set
Escherichia
coli
1062sequences,196736nucleotides
Bacillussubtilis
1148sequences,226928nucleotides
'&
$%
\Experimental"set{MEME
Escherichia
coli
MOTIF1
width
=
46
sites
=
185.2
bits
2.2
2.0
1.7
*
1.5
*
Information
1.3
*
*
content
1.1
*
*
(10.0bits)
0.9
**
*
0.7
*
**
*
0.4
*
**
*
0.2
***
*
**
***
**
*
0.0
----------------------------------------------
Multilevel
AAATAAAAGTTGACATTTTTTGGAGTAAATGGTATAATGCGCCCCC
consensus
CTTATTTCT
TGACAACGCGCCCAATTTGTT
A
C
T
CGGGGA
sequence
C
CTA
C
CACGAATGTCCGCC
A
A
T
GGC
A
C
T
C
'&
$%
\Experimental"set{MEME
Bacillussubtilis
MOTIF
1
width=
30
sites
=
121.0
bits2.2
2.0
1.7
1.5
Information
1.3
*
*
content
1.1
*
*
*
(11.6
bits)
0.9
**
**
*
0.7
***
**
*
*
0.4
***
**
***
0.2
******
**
*******
0.0
------------------------------
Multilevel
TTGACATTATTTTAAAAATATGATATAATA
consensus
TTATAATAAAATTTTGT
G
A
G
sequence
C
CC
AG
T
'&
$%
\Experimental"set{Combinatorialalgorithm
(1box)
Escherichia
coli
Bacillussubtilis
ATAATGCGG
34
3.90
24
TATAATA
94
48.06
32
TATAATGCGC
23
1.60
19
GTATAAT
74
34.34
24
Family1
ATAATGCGC
30
5.75
17
TGTTATA
66
34.96
15
TGTGTATA
47
15.85
16
TTTTACA
76
45.96
13
ACAATGCGC
24
3.85
15
ATAATAT
82
52.52
13
GTTGACAC
36
10.80
14
GTGACA
68
39.76
12
TCACACTT
36
11.10
13
TTTACAA
75
48.56
10
Family2
TGACACTT
38
12.35
13
GTTGAC
66
40.10
10
GCTGACA
64
31.55
12
TTGACA
92
66.34
10
ACACTTAT
41
14.95
12
ATGATA
10
80.26
10
TTGACACT
37
13.75
11
TTACGCTG
39
12.80
14
Family3
TGTTACGC
39
14.45
12
TTTACGCT
44
17.85
11
Family4
TTTTTTTTTC
23
5.40
11
Family5
GCGCCCC
44
18.85
10
'&
$%
\Experimental"set{Combinatorialalgorithm
(2boxes)
Escherichia
coli
Bacillussubtilis
[4,6]
[6,8]
[9,11]
[14,16][15,17]
[17,19][16,18]
[19,21][18,20]
[22,24]
[5,7]
[7,9][8,10]
[10,12][11,13][12,14][13,15]
[20,22][21,23]
[23,25][24,26]
Χ2
[4,6]
[6,8]
[9,11]
a
[14,16][15,17]
[17,19][16,18]
[19,21][18,20]
[22,24]
[5,7]
[7,9][8,10]
[10,12][11,13][12,14][13,15]
[20,22][21,23]
[23,25][24,26]
Χ2
distances between tw
o parts of a model
8 9
121110
TTATTC_TATAAT
TTGACT_ATAATG
distances between tw
o parts of a model
TTGACA_TATAAT
b18171615141312
TTGACT_TAAAAT
TTGACT_TAAAAT
'&
$%
\Genomic"set{MEME
Escherichia
coli
MOTIF
1
width=
30
sites
=
111.4
bits2.2
2.0
1.7
1.5
Information
1.3
*
*
content
1.1
*
*
*
*
(12.3
bits)
0.9
*
*
*
*
0.7
***
****
0.4
**
****
*****
0.2
***
*****
*******
0.0
------------------------------
Multilevel
AATTTTAAATTGTGATCTAAATCACATATT
consensus
CGAAGATTTA
C
AGTGT
G
ATAA
sequence
G
G
TAGT
G
'&
$%
\Genomic"set{MEME
Escherichia
coli
MOTIF
2
width=
39
sites
=
128.7
bits
2.2
2.0
1.7
1.5
Information
1.3
content
1.1
(12.0
bits)
0.9
*
0.7
*
*
*
0.4
**
*
*
*
*
*
**
*
*
0.2
****
****
*
*
***
****
*
**
*
0.0---------------------------------------
Multilevel
TAATTAATATACACAATTTTTTTTTTATTTTCATGATTT
consensus
AC
AATTATCTAGTTAAAACAAGAATAAAAT
TCAAA
sequence
C
C
CGTA
C
GG
G
A
C
T
C
C
'&
$%
\Genomic"set{MEME
Escherichia
coli
MOTIF
3
width=
12
sites
=
181.4
bits
2.2
2.0
1.7
*
1.5
*
Information
1.3
**
content
1.1
**
(6.2bits)
0.9
**
0.7
**
0.4
**
*
0.2
********
**
0.0
------------
Multilevel
CGCCCTGTTTGC
consensus
T
GACTCCGTG
sequence
AGG
ACT
'&
$%
\Genomic"set{MEME
Bacillussubtilis
MOTIF
1
width=
12
sites
=
308.7
bits
2.2
2.0
1.7
1.5
Information
1.3
content
1.1
**
(5.6bits)
0.9
**
0.7
**
0.4
******
0.2
********
0.0
------------
Multilevel
AAAAAAAGGAGG
consensus
TGG
ACGAA
sequence
CT
T
T
'&
$%
\Genomic"set{MEME
Bacillussubtilis
MOTIF
2
width
=
22
sites=
54.4
bits
2.2
2.0
1.7
1.5
Information
1.3
content
1.1
(8.3
bits)
0.9
0.7
**
*
*
0.4
***
*
**
*
*
0.2
******
*****
*
**
**
0.0
----------------------
Multilevel
GGCAGCAGCCCGTGCAGAGCGA
consensus
C
T
C
AAA
GAATACCGAG
sequence
G
T
A
CTAAC
'&
$%
\Genomic"set{MEME
Bacillussubtilis
MOTIF
3
width
=
43
sites
=
173.2
bits
2.2
2.0
1.7
1.5
Information
1.3
content
1.1
(11.3bits)
0.9
0.7
*
*
0.4
**
*
**
*
*
0.2
*****
*
*****
********
**
*
*
*
*
0.0
-------------------------------------------
Multilevel
TTTTTTCATAATTTTTTTTTTTTTCTTTTTTTATTTAATATTT
consensus
CCCCCAACACCAACCACACACCCTCA
ACCTCAACTATAGA
sequence
CTT
TT
CA
G
C
G
T
C
C
'&
$%
\Genomic"set{Combinatorialalgorithm
(1box)
Escherichia
coli
Bacillussubtilis
CCTGAC
573
424.60
39
TATGATA
627
407.05
91
CTGACG
587
439.70
38
TATCATA
615
403.00
84
Family1
CTGACA
701
557.00
36
TATAATAA
445
277.95
58
TCCTGA
671
538.70
30
TTATTATA
439
273.85
57
CCCTGA
575
446.80
28
TACTATA
491
325.70
54
GTCAGG
576
412.10
47
ATGATAA
617
477.10
36
TGTCAG
702
555.00
37
ATGAGAA
500
377.15
29
Family2
CATCAG
711
574.60
32
TGAGAAA
520
417.85
19
CGTCAG
580
443.60
32
ATCAGG
689
553.30
32
TTTTCTG
553
419.20
31
TGACAAA
510
405.20
21
CTCTTTT
464
348.50
25
Family3
TTTCTGT
469
357.05
23
TTTTCAG
531
416.40
23
CTGATTT
498
384.60
23
CAGAAAA
539
410.55
29
CCTTTTT
638
413.05
95
CTGAAAA
525
407.75
24
CCTTTTC
496
291.10
84
Family4
GAGAAAA
460
359.50
19
CTCTTTT
600
391.80
81
AGATAAA
512
415.60
16
CTTTTCT
613
410.90
77
GTGAAAA
509
414.75
16
CTTTTTC
652
451.20
76
etc
'&
$%
\Genomic"set{Combinatorialalgorithm
(2boxes)
Escherichia
coli
Bacillussubtilis
19 23 27 31 35 Χ2
[4,6]
[6,8]
[9,11]
[14,16][15,17]
[17,19][16,18]
[19,21][18,20]
[22,24]
[5,7]
[7,9][8,10]
[10,12][11,13][12,14][13,15]
[20,22][21,23]
[23,25][24,26]
TTGACA_TATAAT
TTGACA_TATAAT
GAAAAA_TTTTTC
distances between tw
o parts of a model
b
ATTGAC_TATAAT
a
[4,6]
[6,8]
[9,11]
[14,16][15,17]
[17,19][16,18]
[19,21][18,20]
[22,24]
[5,7]
[7,9][8,10]
[10,12][11,13][12,14][13,15]
[20,22][21,23]
[23,25][24,26]
Χ2
13 17 21 25 29 31
TTGTGA_TCACAT
TGTGAT_ACATTT
TGTGAT_TCACAT
TGTGAT_TCACAT
distances between tw
o parts of a model
'&
$%
\Noise"inthedata
Approachesusingastatisticalconservationmeasure
Donotselectanoccurrenceinasequenceifthe
scoreobtainedisbelow
agiventhresholdforallp
Approachesusingadeterministicconservationmeasure
Quorum
'&
$%
Variablelengthofthemotifs
Approachesusingastatisticalconservationmeasure
Problem
:therelativeentropyisalwayspositive
andcanonlyincrease
Twopossiblesolutions
Normalizetheentropybythematrixlength
Estimatea\p-value"
Approachesusingadeterministicconservationmeasure
Noproblem
'&
$%
Variousdi�erentfamiliesofmotifsinasamesequence
dataset
Approachesusingastatisticalconservationmeasure
Variousmatricesarekept
Approachesusingadeterministicconservationmeasure
Noproblem
(onthecontrary)
'&
$%
\Toomany"motifsfoundbytheapproachesusinga
deterministicconservationmeasure
A
posterioristatisticalevaluationofthemotifsfound
Careful!Di�erentingeneralfrom
thestatistics
employedbyGibbs
A
prioriprobabilityofagivenmotif(wordorsetof
words)
Sameprobabilitybutestimatedbysimulation
ApplicationofmethodssuchasGibbsonthemotifs
initiallyfoundbyanexhaustivesearch
Comparisonwithobservedon\counter-exampledata"
'&
$%
Otherconstraints
Palindromicorrepeatedmotifs
Quiteafew
approachesmayconsidersuchmotifs
Positioninrelationtoabiologicallandmarkinthesequence
Someapproaches(vanHeldenetal.,NAR,
28:1808-1818,2000inparticular)takethisinto
account(duringtheidenti�cationsteporatprinting
time)
'&
$%
New
approaches
Inferencefrom
asetofphylogeneticallyrelatedsequences
(\Phylogeneticfootprinting")
Simplewayofconstructingasetofmolecular
sequencesthatisreducedinsizeandpotentially
containsless\noise"
Motifconservationmeasureswhichtakeinto
accountthephylogenyoftheorganisms(Blanchette
etal.,ISMB
2000,
http://ismb00.sdsc.edu/technical-program.html)
'&
$%
Phylogeneticfootprinting{Mainidea
A
setofphylogeneticallyrelatedsequences
TTCG
ATCG
AACG
ATGG
TTCG
...AACG......AATG...
...TACG......TTCG...
1
1
1
0
0
1
1
0
total:5
'&
$%
Phylogeneticfootprinting{A
hintofthediÆculties
possiblyevolutionary
unrelatedsequences
\our"motifs(motifs)
TATA
AAAT
AAAT
AATA
AAAT
TAAA
\ancestor"ofastar-tree
(butnotthemostparsimonious)
'&
$%
Phylogeneticfootprinting{A
hintofthediÆculties
evolutionaryrelated
sequences(orthologs)
themotifsweshouldseek
motif
\ancestor"ofthe\true"evolutionarytree
(underparsimony)forthespeciesconcerned
'&
$%
Phylogeneticfootprinting{A
hintofthediÆculties
evolutionaryrelated
sequences(orthologs)
themotifsweshouldseek
??
plusotherevolutionaryrelated(ina
(di�erentway)sequences(paralogs)
'&
$%
Phylogeneticfootprinting{A
hintofthediÆculties
evolutionaryrelated
sequences(orthologs)
themotifsweshouldseek
????
how
tomodelsuch
\multi-dimensionalconservation"
?
plusotherevolutionaryrelated(ina
(di�erentway)sequences(paralogs)
plusevolutionaryunrelatedsequences
'&
$%
A
specialuseofphylogeneticfootprinting{Gene�nding
by\purehomology"
'&
$%
A
veryelementaryview
ofaneukaryoticgene
3
nucleotides
(codon)!
1
am
ino
acid
5'U
TR
5'
3'U
TR
3'
exon
intron
startcodon
G
T
(donor
site)
AG
(acceptor
site)
stop
codon
splicing
gene
!
protein
5'U
TR
and
3'U
TR
:transcribed
into
R
N
A
butnottranslated
'&
$%
Gene�nding{A
few
generalities
Detectionbysignal
Promotersequence(verydiÆcult)
Splicing(donorandacceptor)sites
PolyA
signal
Detectionbydi�erenceofcomposition
Themostcommon:di�erentk-mercounts(oftenk=
6)
Detectionbyhomologywith\known"(storedina
database)sequence
(Observehomologyis\stronger"thansimilarity)
'&
$%
A
complicatedcaseofgene�nding{\Orphangenes"
Anorphangeneisageneforwhichnohomology(inthe
sensehereof\strongenough"similarity)hasbeen
detectedwiththesequencesstoredinthedatabases
'&
$%
Mainideaaroundtheproblem
Anorphanmayhave\parents",thatisotherorphanslike
itselfwhichareitsHOMOLOGS(havingcommon
ancestor)
ORTHOLOGSpossiblymoresimilar
PARALOGS
possiblyhavingdivergedmore
inbothcases,havingpossiblydi�erentgenestructures
(i.e.
adi�erentnumberofexons)
'&
$%
Additionalhypothesis
(importantbutisitalwaysjusti�ed?)
Exonsare\betterconserved"or,moreaccurately,
\di�erentlyconserved"thanintronsor5'UTR
or3'UTR
orintergenicsequence
'&
$%
Flavourofmethod
Findstructurebycomparingorphanswhicharehomologs
usingasinformationonlythebareessentials(methodby
\purehomology")
Usingadynamicprogrammingapproachwithafew
twists
Sequencesarecomposedofcodingandnoncoding
regions
Therearetwopotentialtypesof\errors"
Nature's(gaps,substitutions)
Man's(sequencingreadingerrors=
\frameshifts")
Utopia(Blayoetal.,acceptedTCS)
'&
$%
Objective
Findbestassemblyofexons
whichsatis�es\bareessentials"genemodel
where\best"meanshighestscoringassemblageofexons
'&
$%
W
hyacombinatorial,\bareessentials"typeofapproach?
Itdoesnotsubstituteforother,statisticalinparticular,
approaches
Itcannot(perhaps)evencompetewiththem
(itwasnot
meantto)
BUTIt
isGENERIC
Itallows,indeedobligestothinkoveragainour
notionsof\conservation"and,inparticular,ofthe
non-conservationofnon-codingregions
Itisindependentofwhatcanbelearnedfrom
speci�ccharacteristicsofknownexamplesofgenes
'&
$%
Preliminaryapplications(1)
13ADH
proteingenesofplants(amongthem
Arabidopsis
thaliana)
dicoandmonocotyledones
oneparalogand12orthologs
ofdi�erentgenestructuresandlengths
5to10exons
from
942bp.to1046bp.
'&
$%
-4500
-4000
-3500
-3000
-2500
-2000
-1500
-1000
-500
0
0 500 1000 1500 2000 2500
D84
240
M36
469
M59
082
U36
586
U53
701
U63
931
U65
972
X02
915
X04
050
X54
106
Z24
755
X12733 (9 exons)
X12733 compared with 11 related sequences with pam120, intronIndel 20. Specif : 97% Sensit : 98%
’annot’’D84240’’M36469’’M59082’’U36586’’U53701’’U63931’’U65972’’X02915’’X04050’’X54106’’Z24755’
'&
$%
Sensitivityandspeci�city
Sensitivity
sensitivity=
numberofcorrectlypredicteditems
numberofactualitems
=
TP
TP
+
FN
Speci�city
specificity=
numberofcorrectlypredicteditems
numberofpredicteditems
=
TP
TP
+
FP
'&
$%
Preliminaryapplications(2)
7genesfrom
amultigenefamilyinArabidopsisthaliana
ofunknownfunction
goingbythenameofMYST
ofdi�erentgenestructuresandlengths
13to15exons
from
1848bp.to2040bp.
'&
$%
-7000
-6000
-5000
-4000
-3000
-2000
-1000
0
0 1000 2000 3000 4000 5000 6000 7000
MY
ST
2 M
YS
T3
MY
ST
4 M
YS
T5
MY
ST
6 M
YS
T7
MYST1 (15 exons)
MYST1 compared with 6 related sequences with pam120, intronIndel 20. Specif : 93% Sensit : 97%
’annot’’MYST2’’MYST3’’MYST4’’MYST5’’MYST6’’MYST7’
'&
$%
-7000
-6000
-5000
-4000
-3000
-2000
-1000
0
0 500 1000 1500 2000 2500 3000 3500
MY
ST
1 M
YS
T3
MY
ST
4 M
YS
T5
MY
ST
6 M
YS
T7
MYST2 (13 exons)
MYST2 compared with 6 related sequences with pam120, intronIndel 20. Specif : 93% Sensit : 98%
’annot’’MYST1’’MYST3’’MYST4’’MYST5’’MYST6’’MYST7’
'&
$%
Mainidea(currentlybeingexplored)
Puttingtogethermotifinferenceandgenedetectionby
multiplecomparison