bp-203 foundations for mathematical biology statistics lecture...
TRANSCRIPT
BP-
203
Foun
datio
ns fo
r Mat
hem
atic
al B
iolo
gySt
atis
tics L
ectu
re II
I B
y H
ao L
iN
ov 8
, 200
1
Stat
istic
al M
odel
ing
and
Infe
renc
e
data
col
lect
ion
cons
truct
ing
prob
abili
stic
mod
elin
fere
nce
of m
odel
par
amet
ers
inte
rpre
ting
resu
ltsm
akin
g ne
w p
redi
ctio
ns
Max
imum
like
lihoo
d A
ppro
ach
Exam
ple
A:
Toss
a c
oin
N ti
mes
, obs
erve
m h
eads
in a
spec
ific
sequ
ence
Mod
el: b
inom
ial d
istri
butio
n In
fere
nce:
the
para
met
er
Pred
ictio
n: e
.g.,
how
man
y he
ads w
ill b
e ob
serv
ed
for a
noth
er L
tria
ls
p
mN
mp
pp
mP
−−
=)
1()
|(
Prob
. of o
bser
ving
a sp
ecifi
cse
quen
ce o
f m h
eads
Find
a
such
that
the
abov
e pr
ob.
is m
axim
ized
0ˆ
)|
(lo
g=
∂∂
pp
pm
PN
mp
/ˆ
=p
[])ˆ
1lo
g()ˆ
1(ˆ
log
ˆ)ˆ
|(
log
pp
pp
Np
mP
−−
+= -ent
ropy
How
goo
d is
the
estim
ate?
Dis
tribu
tion
of
u
nder
repe
ated
sam
plin
g
Cen
tral l
imit
theo
rem
di
strib
utio
n of
m a
ppro
ache
s nor
mal
for l
arge
N
)1(
~p
Np
Np
m−
±
Np
pp
p/)
1(~
ˆ−
±
Thus
the
estim
ate
conv
erge
s to
the
real
p w
ith a
squa
re-r
oot c
onve
rgen
ce
p̂
Max
imum
like
lihoo
d A
ppro
ach
Exam
ple
B:
Nxx
x,..
.,,
21 inde
pend
ent a
nd id
entic
ally
dis
tribu
ted
(i.i.d
) sam
ple
draw
n fr
om a
nor
mal
dis
tribu
tion
Estim
ate
the
mea
n an
d th
e va
rianc
e
Max
imiz
ing
the
likel
ihoo
d fu
nctio
n (s
how
this
is tr
uein
the
hom
ewor
k )
),
(2
σµ
N
Nx
x
Nx
x
N ii
N ii
/)
(ˆ
/ˆ
2
1
2
1
−=
==
∑
∑
=
=
σµ
Gen
eral
form
ulat
ion
of
the
max
imum
like
lihoo
d ap
proa
ch
D:
obse
rved
dat
aM
: th
e st
atis
tical
mod
elpa
ram
eter
s of t
he m
odel
prob
abili
ty o
f obs
ervi
ng th
e da
ta
give
n th
e m
odel
and
par
amet
ers
the
likel
ihoo
d of
as a
func
tion
of d
ata
Max
imum
like
lihoo
d es
timat
e of
the
para
met
ers
θ
),
|(
θM
DP
θ)
,|
()
;(
θθ
MD
PD
L≡
);
(m
axar
gˆ
DL
θθ
=
Theo
rem
:
conv
erge
s to
the
true
in th
e la
rge
sam
ple
limit
with
err
or
θ̂0θ
N/1
~
Exam
ple
C:
Segm
enta
tion
a se
quen
ce o
f hea
d (1
) an
d ta
il (0
) is g
ener
ated
by
first
usin
g a
coin
with
an
d th
en c
hang
e to
a c
oin
with
th
e ch
ange
poi
nt u
nkno
wn
Dat
a =
(001
0100
0000
0010
1111
0111
1100
010)
1p2p
)(
2)
(2
)(
1)
(1
21
22
11
)1(
)1(
),
|,
(x
mx
Nx
mx
mx
xm
pp
pp
pp
xseq
P−
−−
−−
=
posi
tion
right
bef
ore
the
chan
ge
num
ber o
f 1’s
up
to x
num
ber o
f 1’s
afte
r x
tota
l num
ber o
f tos
ses
x
)( 1x
m
)(
2x
m N
Exam
ple
C c
ontin
ued
For f
ixed
max
imiz
e
w
ith re
spec
t to
a
ndx
),
|,
(2
1p
px
seq
P1p
2p
[]
[] )
ˆ1
log(
)ˆ
1(ˆ
log
ˆ)
()
ˆ1
log(
)ˆ
1(ˆ
log
ˆ)
ˆ ,ˆ
|,
(lo
g
22
22
11
11
21
pp
pp
xN
pp
pp
xp
px
seq
P−
−+
−+
−−
+=
)/()
(ˆ
/)(
ˆ
22
11
xN
xm
px
xm
p−
==
Then
max
imiz
e
with
resp
ect t
o )
ˆ ,ˆ
|,
(2
1p
px
seq
Px
The
abov
e ap
proa
ch is
som
etim
e re
ferr
ed a
s “en
tropi
cse
gmen
tatio
n”, a
s it
tries
to m
inim
ize
the
tota
l ent
ropy
A g
ener
aliz
atio
n of
the
abov
e m
odel
to 4
alp
habe
t and
unk
now
n nu
mbe
rof
bre
akin
g po
ints
can
be
used
to se
gmen
t DN
A se
quen
ces i
nto
regi
ons
of d
iffer
ent c
ompo
sitio
n. m
ore
natu
rally
des
crib
ed b
y a
hidd
en M
arko
v m
odel
.
Exam
ple
D: d
etec
ting
wea
k co
mm
on se
quen
ce p
atte
rns i
n a
set o
f rel
ated
sequ
ence
s
e.g.
, loc
al se
quen
ce m
otifs
for f
unct
iona
lly o
r stru
ctur
ally
rela
ted
prot
eins
(no
over
all s
eque
nce
sim
ilarit
y)
regu
lato
ry e
lem
ents
in th
e up
stre
am re
gion
s of
co-r
egul
ated
gen
es, c
ould
be
gene
s clu
ster
ed to
geth
erby
mic
roar
ray
data
the
sim
ples
t situ
atio
n: e
ach
sequ
ence
con
tain
one
real
izat
ion
of th
em
otif
with
giv
en le
ngth
, but
the
star
ting
posi
tions
are
unk
now
n
YA
R07
1W:6
00:-6
00
\cat
caag
atga
gaaa
ataa
aggg
atttt
ttcgt
tcttt
tatc
atttt
ctct
ttctc
acttc
cgac
tact
tctta
tatc
tact
ttcat
cgttt
cattc
atcg
tggg
tgtc
taat
aaag
tttta
atga
caga
gata
acct
tgat
aagc
tttttc
ttata
cgct
gtgt
cacg
tattt
atta
aatta
ccac
gtttt
cgca
taac
attc
tgta
gttc
atgt
gtac
taaa
aaaa
aaaa
aaaa
aaa
gaaa
tagg
aagg
aaag
agta
aaaa
gtta
atag
aaaa
caga
acac
atcc
ctaa
acga
agcc
gcac
aatc
ttggc
gttc
acac
gtgg
gttta
aaaa
ggca
aatta
caca
gaa
tttca
gacc
ctgt
ttacc
ggag
agat
tcca
tattc
cgca
cgtc
acat
tgcc
aaat
tggt
catc
tcac
caga
tatg
ttata
cccg
ttttg
gaat
gagc
ataa
acag
cgtc
gaa
ttgcc
aagt
aaaa
cgta
tata
agct
ctta
cattt
cgat
agat
tcaa
gctc
agttt
cgcc
ttggt
tgta
aagt
agga
agaa
gaag
aaga
agaa
gagg
aaca
acaa
cagc
aaa
gaga
gcaa
gaac
atca
tcag
aaat
acca
\Y
BR
092C
:600
:-600
\a
atca
atga
cttc
tacg
acta
tgct
gaaa
agag
agta
gccg
gtac
tgac
ttcct
aaag
gtct
gtaa
cgtc
agca
gcgt
cagt
aact
ctac
tgaa
ttgac
cttc
tact
ggga
ctg
gaac
acta
ctca
ttaca
acgc
cagt
ctat
tgag
acaa
tagt
tttgt
ataa
ctaa
ataa
tattg
gaaa
ctaa
atac
gaat
accc
aaat
ttttta
tcta
aattt
tgcc
gaaa
gatta
aaat
ctgc
agag
atat
ccga
aaca
ggta
aatg
gatg
tttca
atcc
ctgt
agtc
agtc
agga
accc
atat
tata
ttaca
gtat
tagt
cgcc
gctta
ggca
cgcc
tttaa
ttagc
aaa
atca
aacc
ttaag
tgca
tatg
ccgt
ataa
ggga
aact
caaa
gaac
tggc
atcg
caaa
aatg
aaaa
aaag
gaag
agtg
aaaa
aaaa
aaaa
ttcaa
aaga
aattt
acta
aata
atac
cagt
ttggg
aaat
agta
aaca
gcttt
gagt
agtc
ctat
gcaa
cata
tata
agtg
ctta
aattt
gctg
gatg
gaag
tcaa
ttatg
ccttg
atta
tcat
aaaa
aaaa
tact
acag
taaa
gaaa
gggc
cattc
caaa
ttacc
t\Y
BR
093C
:600
:-600
\c
gcta
atag
cggc
gtgt
cgca
cgct
ctct
ttaca
ggac
gccg
gaga
ccgg
catta
caag
gatc
cgaa
agttg
tattc
aaca
agaa
tgcg
caaa
tatg
tcaa
cgta
tttgg
aagt
catc
ttatg
tgcg
ctgc
tttaa
tgttt
tctc
atgt
aagc
ggac
gtcg
tcta
taaa
cttc
aaac
gaag
gtaa
aagg
ttcat
agcg
ctttt
tcttt
gtct
gcac
aaag
aaat
ata
tatta
aatta
gcac
gtttt
cgca
taga
acgc
aact
gcac
aatg
ccaa
aaaa
agta
aaag
tgat
taaa
agag
ttaat
tgaa
tagg
caat
ctct
aaat
gaat
cgat
acaa
ccttg
gcac
tcac
acgt
ggga
ctag
caca
gact
aaat
ttatg
attc
tggt
ccct
gtttt
cgaa
gaga
tcgc
acat
gcca
aatta
tcaa
attg
gtca
cctta
cttg
gcaa
ggca
tata
ccc
atttg
ggat
aagg
gtaa
acat
ctttg
aattg
tcga
aatg
aaac
gtat
ataa
gcgc
tgat
gtttt
gcta
agtc
gagg
ttagt
atgg
cttc
atct
ctca
tgag
aata
agaa
caa
caac
aaat
agag
caag
caaa
ttcga
gatta
cca\
YB
R29
6C:6
00:-6
00
\gaa
atct
cggt
ttcac
ccgc
aaaa
aagt
ttaaa
tttca
caga
tcgc
gcca
cacc
gatc
acaa
aacg
gcttc
acca
caag
ggtg
tgtg
gctg
tgcg
atag
acct
tttttt
tctt
tttct
gcttt
ttcgt
catc
ccca
cgttg
tgcc
atta
atttg
ttagt
gggc
cctta
aatg
tcga
aata
ttgct
aaaa
attg
gccc
gagt
cattg
aaag
gcttt
aaga
atat
accg
tac
aaag
gagt
ttatg
taat
ctta
ataa
attg
cata
tgac
aatg
cagc
acgt
ggga
gaca
aata
gtaa
taat
acta
atct
atca
atac
taga
tgtc
acag
ccac
tttgg
atcc
ttcta
ttatg
taaa
tcat
taga
ttaac
tcag
tcaa
tagc
agat
tttttt
taca
atgt
ctac
tggg
tgga
catc
tcca
aaca
attc
atgt
cact
aagc
ccgg
ttttc
gata
tgaa
gaaa
atta
tat
ataa
acct
gctg
aaga
tgat
cttta
cattg
aggt
tattt
taca
tgaa
ttgtc
atag
aatg
agtg
acat
agat
caaa
ggtg
agaa
tact
ggag
cgta
tcta
atcg
aatc
aata
taa
acaa
agat
taag
caaa
aatg
\
Exam
ple:
22
gene
s ide
ntifi
ed a
s pho
4 ta
rget
by
mic
roar
ray,
O’s
hea
lab
A m
odel
for t
he m
otif
AA
ATG
A
AG
GTC
C
AG
GA
TG AG
AC
GT
alig
nmen
t m
atrix
1 2
3
4
5
6
A4
1
2
1
0
1
C0
0
0
1
1
1
G0
3
2
0
2
1
T
0
0
0
2
1
1
posi
tion
spec
ific
prob
abili
ty m
atrix
1 2
3
4
5
6A
1.00
0
.25
0
.50
0.
25
0.0
0
0.2
5C
0.00
0
.00
0
.00
0.
25
0.2
5
0.2
5G
0.00
0
.75
0.
50
0.0
0
0.50
0
.25
T
0.0
0
0.00
0
.00
0.
50
0.2
5
0.2
5
σ,if
Mod
el:
prob
abili
ty o
f obs
ervi
ng c
erta
in b
ase
insi
deth
e m
otif
is g
iven
by
the
abov
e m
atrix
prob
abili
ty o
f obs
ervi
ng c
erta
in b
ase
outs
ide
the
mot
if is
giv
en b
y th
e ba
ckgr
ound
freq
uenc
y 0 σf
),..
.,,
(2
1Nx
xx
x=
vSt
artin
g po
sitio
ns o
f the
mot
if un
know
n
Posi
tion
spec
ific
pro
babi
lity
mat
rix u
nkno
wn
need
to b
e in
ferr
ed fr
om th
e ob
serv
ed se
quen
ce d
ata
∏∏
∏∏
=
− =
−+ =
+=
+−
=N i
x j
wx
xj
L
wx
jx
ji
ii
ii
ijij
iij
ff
ff
xseq
P1
1 1
10
,10
,)
|,
(σ
σσ
σv
σ,if
ijwLN σ
Num
ber o
f seq
uenc
esLe
ngth
of t
he se
quen
ceW
idth
of t
he m
otif
Bas
e of
sequ
ence
i at
pos
ition
j
()
)(
1,
,
0,)
|,
(x
nw j
ffi
j
jconst
fx
seq
Pv
vσ
σσ
σσ
∏∏
=
=
)(
,x
n jv
σTo
tal n
umbe
r of c
ount
for b
ase
at
posi
tion
j in
the
alig
nmen
tσ
likel
ihoo
d ra
tio
Max
imiz
ing
w.r.
t.
W
ith
fix
ed
()
∑ =
=w j
ffj
ij
fN
fx
seq
P1
ˆ
,,
0,lo
gˆ
)ˆ
|,
(lo
gσσ
σσ
v
∑=
σσ
σσ
)()
(,
,
,ˆ
xn
xn
jj
jf
v
v
xvσ,if
log
likel
ihoo
d ra
tiore
lativ
e en
tropy
Then
max
imiz
e th
e ab
ove
rela
tive
entro
py w
.r.t
Alig
nmen
t pat
h.
)|
,(
,σifx
seq
Pv
xv
in re
ality
, thi
s for
mul
a is
mod
ified
by a
ddin
g ps
eudo
cou
nts d
ue to
B
aysi
an e
stim
ate
Stor
mo-
Har
tzel
l Alg
orith
m: C
onse
nsus
each
of t
he le
ngth
w
subs
tring
sof t
he fi
rst s
eque
nce
are
alig
ned
agai
nst a
ll th
e su
bstri
ngso
f the
sam
e le
ngth
in th
e se
cond
sequ
ence
, mat
rices
der
ived
, N to
p m
atric
esw
ith h
ighe
st in
form
atio
n co
nten
ts a
re sa
ved
the
next
sequ
ence
on
the
list i
s add
ed to
the
anal
ysis
, all
the
mat
rices
save
d pr
evio
usly
are
pai
red
with
the
subs
tring
sof
the
adde
d se
quen
ce a
nd to
p N
mat
rices
save
d
repe
at th
e pr
evio
us st
ep u
ntil
all t
he se
quen
ces h
ave
been
pro
cess
ed
MA
TRIX
1nu
mbe
r of s
eque
nces
= 2
2in
form
atio
n =
8.80
903
ln(p
-val
ue) =
-153
.757
p-
valu
e =
1.67
566E
-67
ln(e
xpec
ted
freq
uenc
y) =
-13.
357
exp
ecte
d fr
eque
ncy
= 1.
5816
5E-0
6
A |
6
5 2
0 3
0
3
0
0 0
6
G |
11
0
0 5
22
0 2
1 1
5 1
4 2
C |
4 1
7
0 1
4 0
0
1
2
8
1T
| 1
0
2
0
0 1
9 0
5
0
13G
C
A
C
G
T G
G
G
T
1|1
:
1/31
7 A
CA
CG
TGG
GT
2|2
:
2/55
A
AA
GG
TCTG
T3|
3 :
3/
347
AC
AC
GTG
GG
A4|
4 :
4/
274
GC
AC
GTG
GG
A5|
5 :
5/
392
CA
AC
GTG
TCT
6|6
:
6/39
5 A
CA
AG
TGG
GT
7|7
:
7/32
1 A
CA
CG
TGG
GA
8|8
:
8/53
6 G
CA
AG
TGG
CT
9|9
:
9/17
7 G
CTG
GTG
TGT
10|1
0 :
10/
443
GC
AC
GTG
TCT
11|1
1 :
11/
14
CC
AG
GTG
CC
T12
|12
: 1
2/50
2 G
AA
AG
AG
GC
A13
|13
: 1
3/35
4 G
CA
CG
AG
GG
A14
|14
: 1
4/25
7 G
CA
CG
TGC
GA
15|1
5 :
15/
358
TC
AC
GTG
TGT
16|1
6 :
16/
316
AC
AC
GTG
GG
T17
|17
: 1
7/47
9 G
CA
CG
TGG
CT
18|1
8 :
18/
227
GA
TGG
TGG
CT
19|1
9 :
19/
186
GC
AC
GTG
GG
G20
|20
: 2
0/32
6 G
AA
GG
AG
GG
G21
|21
: 2
1/30
7 C
CA
CG
TGG
GC
22|2
2 :
22/
255
CC
AC
GTG
GC
T
Con
sens
us o
utpu
t for
Pho
4 re
gula
ted
gene
s
Max
imum
like
lihoo
d es
timat
e w
ith m
issi
ng d
ata
Gen
eral
form
ulat
ion
Expe
ctat
ion
and
Max
imiz
atio
n (E
M) a
lgor
ithm
Mis
sing
dat
a: in
exa
mpl
e C
, the
poi
nt w
here
the
coin
is c
hang
edin
exa
mpl
e D
, the
star
ting
posi
tions
of t
he m
otif
in th
e m
axim
um li
kelih
ood
appr
oach
, the
re is
a c
ruci
al d
istin
ctio
nbe
twee
n pa
ram
eter
s (po
pula
tion)
such
as t
he p
ositi
on sp
ecifi
cpr
obab
ility
mat
rix a
nd th
e m
issi
ng d
ata,
sinc
e m
issi
ng d
ata
grow
with
the
sam
ple
size
and
in g
ener
al c
an n
ot b
e re
cove
red
prec
isel
yev
en if
the
sam
ple
size
goe
s to
infin
ity
For m
any
prob
lem
s, it
is n
eces
sary
to su
m o
ver a
ll m
issi
ng d
ata
∑=
yy
xP
xL
)|
,(
);
(θ
θ
Whe
re
is
the
obse
rved
dat
a an
d
is t
he m
issi
ng d
ata
xy
To e
stim
ate
the
para
met
ers,
one
max
imiz
e th
e lik
elih
ood
func
tion
how
ever
, it i
s ofte
n di
ffic
ult t
o pe
rfor
m th
e su
mm
atio
nov
er m
issi
ng d
ata
expl
icitl
y)
;(
θx
L Expe
ctat
ion
Max
imiz
atio
n (E
M) a
lgor
ithm
Impr
ove
the
estim
ate
of t
he p
aram
eter
s ite
rativ
ely
Giv
en a
n es
timat
e
fi
nd
that
incr
ease
s the
like
lihoo
d fu
nctio
n
E st
ep: c
alcu
late
the
Q fu
nctio
n, th
e ex
pect
atio
n of
over
mis
sing
dat
a w
ith p
rob.
giv
en b
y th
e cu
rren
t par
amet
er
M st
ep: m
axim
ize
the
Q fu
nctio
n to
get
an
new
est
imat
e
)|
,(
log
),
|(
)|
(θ
θθ
θy
xP
xy
PQ
y
tt
∑≡
tθ
1+tθ
)|
,(
log
θy
xP
)|
(m
axar
g1
tt
Qθ
θθ
=+
),
|(
),
|(
log
),
|(
)|
()
|(
)|
(lo
g)
|(
log
θθθ
θθ
θθ
θθ
xy
Pxy
P
y
t
tt
tt
t
xy
P
xP
xP
∑+
−=
−
)|
()
|(
)|
(lo
g)
|(
log
tt
tt
xP
xP
θθ
θθ
θθ
−≥
−
That
the
EM a
lgor
ithm
alw
ays i
ncre
ase
the
likel
ihoo
d fu
nctio
nC
an b
e pr
oved
by
the
follo
win
g eq
uatio
n an
d in
equa
lity
mot
if1m
otif2
mot
if3
A set of
Regulatory
Sequences
How
do
we
find
thes
e m
otifs
?
Exam
ple
E: id
entif
ying
com
bina
toria
l mot
ifsby
wor
d se
gmen
tatio
n
chapterptgpbqdrftezptqtasctmvivwpecjsnisrmbtqlmlfvetl
loom
ingsfkicallxjgkm
ekysjerishmaeljplfsom
eylqyearstvh
njbagoaxhjtjcokhvneverpm
qpmindhowzrbdlzjllonggbhqi
preciselysunpvskepfdjktcgarwtnxybgcvdjfbnohavinglittl
ezorunozsoyapm
oneyyvugsgtsqintmyteixpurseiwfmjwgj
nyyveqxwftlamnbxkrsbkyandrnothingcgparticularwtzao
qsjtnmtoqsnwvxfiupinterestztimebymonlnshoreggditho
ughtyxfxm
hqixceojjzdhwouldsailpcaboutudxsbsnewtpg
gvjaasxm
svlittleplvcydaowgwlbzizjlnzyxandzolwcudthjd
osbopxkkfdosxardgcseebbthefzrsskdhmawateryjikzicim
ypartmofprtheluworldvtoamfutitazpisagwewayrqbkiosh
avebojwphiixofprmalungipjdrivingpkuyoikrwxoffodhicb
nimtheixyucpdzacemspleenqbpcrmhwvddyaiwnandada
bkpgzmptoregulatingeetheslcirculationvsuctzwvfyxstuzr
dfwvgygzoejdfmbqescwheneverpitfindfmyselfcgrowingne
ostumrydrrthmjsmgrimcczhjmgbkwczoaboutjbwanbwzq
thehrjvdrcjjgmouthuutwheneveritddfouishlawwphxnae
Bus
sem
aker
/Li/S
iggi
a M
odel
: Pr
obab
ilist
ic S
egm
enta
tion/
Max
imum
like
lihoo
d
A p
roba
bilis
tic d
ictio
nary
Wor
dspr
obab
ilitie
s
A
P AC
P C
G
P GT
P TG
C
P G
CTA
TAA
P T
ATA
A
A| G
| T |
A| T
| A
|A
|G
| C
A| G
|T
A
T A
A
| G
C
A |
G|T
A
T
A
A| G
| C
nww
wSeg
wP
PP
PZ
...3
21
∑=
max
imiz
ing
the
likel
ihoo
d fu
nctio
n
wor
d bo
unda
ryis
mis
sing
Dic
tiona
ry C
onst
ruct
ion
Para
met
er in
fere
nce:
giv
en th
e en
tries
in th
e di
ctio
nary
, fin
d P W
bym
axim
izin
g th
e lik
elih
ood
func
tion.
Sta
rting
with
a si
mpl
edi
ctio
nary
with
all
poss
ible
wor
ds
Mod
el im
prov
emen
t: do
stat
istic
al te
st o
n lo
nger
wor
ds b
ased
on
the
curr
ent d
ictio
nary
, add
the
ones
that
are
ove
r-re
pres
ente
dre
-ass
ign
P Wby
max
imiz
ing
the
likel
ihoo
d fu
nctio
n
Itera
te th
e ab
ove
)(
)(
)};
({Seg
N
Seg
ww
wp
seq
pL
∑∏
= ∑=
+w
twt
w N
Nwt
p)1
(
)(Seg
Nw
EM a
lgor
ithm
for t
he w
ord
segm
enta
tion
Num
ber o
f wor
d w
in a
giv
en se
gmen
tatio
n
ww
tw
ww
pn
tp
pQ
log
)})
({|}
({∑
=E
step
M st
ep
Ditionary1
Ditionary2
Dictionary3
-----------------------------------------------------------------------------
e
0.065239
e
0.048730
e
0.042774
t
0.055658
s
0.042589
s
0.040843
a
0.052555
a
0.040539
a
0.038595
o
0.050341
t
0.040442
i
0.036897
n
0.049266
i
0.038550
t
0.036871
i
0.048101
d
0.038547
d
0.036323
s
0.047616
o
0.036486
l
0.035336
h
0.047166
l
0.036300
c
0.034818
r
0.043287
g
0.034509
m
0.034650
l
0.041274
r
0.034496
y
0.034482
d
0.039461
c
0.033916
b
0.034396
u
0.034742
m
0.033724
r
0.034105
m
0.034349
n
0.033321
p
0.034044
g
0.034001
y
0.033227
w
0.033819
w
0.033967
p
0.033156
n
0.033817
c
0.032934
f
0.032863
g
0.033676
f
0.032597
b
0.032780
f
0.033534
y
0.031776
w
0.032009
o
0.033206
p
0.031711
h
0.031494
h
0.033200
b
0.031409
v
0.030727
k
0.032103
v
0.028268
k
0.030445
v
0.031498
k
0.028113
u
0.030379
j
0.031209
j
0.026712
j
0.029268
u
0.031186
q
0.026561
z
0.028905
z
0.031003
z
0.026542
x
0.028404
x
0.030544
x
0.026357
q
0.028123
q
0.030244
th
0.009954
the
0.005715
in
0.006408
ing
0.003237
er
0.004755
and
0.003128
an
0.004352
in
0.002968
ou
0.003225
ed
0.002547
on
0.003180
to
0.002496
he
0.003108
of
0.002486
at
0.002851
en
0.001331
ed
0.002804
an
0.001313
or
0.002786
th
0.001270
en
0.002538
er
0.001250
to
0.002511
es
0.001209
of
0.002475
at
0.001181
st
0.002415
it
0.001171
nd
0.002297
that 0.001165
Words
<Nw>
quality factor
--------------------------------------------------------------------------------
abominate 2.0000 1.0000
achieved 2.0000 1.0000
aemploy 2.0000 1.0000
affrighted 2.0000 1.0000
afternoon 2.0000 1.0000
afterwards 5.0000 1.0000
ahollow 2.0000 1.0000
american 3.0000 1.0000
anxious 2.0000 1.0000
apartment 2.0000 1.0000
appeared 4.0000 1.0000
astonishment 4.0000 1.0000
attention 2.0000 1.0000
avenues 2.0000 1.0000
bashful 2.0000 1.0000
battery 2.0000 1.0000
beefsteaks 2.0000 1.0000
believe 2.0000 1.0000
beloved 2.0000 1.0000
beneath 6.0000 1.0000
between 12.0000 1.0000
boisterous 3.0000 1.0000
botherwise 2.0000 1.0000
bountiful 2.0000 1.0000
bowsprit 2.0000 1.0000
breakfast 5.0000 1.0000
breeding 2.0000 1.0000
bulkington 3.0000 1.0000
bulwarksb 2.0000 1.0000
bumpkin 2.0000 1.0000
business 6.0000 1.0000
carpenters 2.0000 1.0000
Tab
le 1
.K
now
n ce
ll cy
cle
site
s an
d so
me
met
abol
ic si
tes t
hat
mat
ch w
ords
from
our
geno
mew
ide
dict
iona
ry
MC
BA
CG
CG
TA
AA
CG
CG
TA
CG
CG
TCG
CG
T C
GC
GA
CG
CG
TTG
AC
GC
GT
SCB
CR
CG
AA
AA
CG
CG
AA
A
SCB
'A
CR
MSA
AA
AC
GC
GA
AA
AC
GC
CA
AA
AA
CG
CC
AA
Swi5
RR
CC
AG
CR
GC
CA
GC
GG
CA
GC
CA
G
SIC
1G
CSC
RG
CG
CC
CA
GC
C C
CG
CG
CG
G
MC
M1
TTW
CC
YA
AW
NN
GG
WA
ATT
TCC
NN
NN
NN
GG
AA
A
NIT
GA
TAA
TTG
ATA
ATG
M
ETTC
AC
GTG
RTC
AC
GTG
TCA
CG
TGM
CA
CG
TGA
C C
AC
GTG
CT
PDR
TCC
GC
GG
ATC
CG
CG
G
HA
PC
CA
AY
AA
CC
CA
AC
MIG
1K
AN
WW
WW
ATS
YG
GG
GW
TATA
TGTG
CA
TATA
TGG
TGG
GG
AG
GA
L4C
GG
N11
CC
GC
GG
N11
CC
G
our d
ictio
nary
vs.
know
n TF
bin
ding
site
s
Yea
st p
rom
oter
dat
abas
e 44
3 no
n-re
dund
ant s
ites
(Zhu
and
Zha
ng, c
old
sprin
g ha
rbor
) 9 (2
.9)
30B
razm
a et
al.
14 (
3.3)
33Sc
ram
bled
dic
tiona
ry
25 (
4.8)
114
Our
dic
tiona
ry
Expe
cted
(s
tand
ard
devi
atio
n)#
of m
atch
es