bp-203 foundations for mathematical biology statistics lecture...

28
BP-203 Foundations for Mathematical Biology Statistics Lecture III By Hao Li Nov 8, 2001

Upload: others

Post on 04-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

BP-

203

Foun

datio

ns fo

r Mat

hem

atic

al B

iolo

gySt

atis

tics L

ectu

re II

I B

y H

ao L

iN

ov 8

, 200

1

Page 2: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

Stat

istic

al M

odel

ing

and

Infe

renc

e

data

col

lect

ion

cons

truct

ing

prob

abili

stic

mod

elin

fere

nce

of m

odel

par

amet

ers

inte

rpre

ting

resu

ltsm

akin

g ne

w p

redi

ctio

ns

Page 3: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

Max

imum

like

lihoo

d A

ppro

ach

Exam

ple

A:

Toss

a c

oin

N ti

mes

, obs

erve

m h

eads

in a

spec

ific

sequ

ence

Mod

el: b

inom

ial d

istri

butio

n In

fere

nce:

the

para

met

er

Pred

ictio

n: e

.g.,

how

man

y he

ads w

ill b

e ob

serv

ed

for a

noth

er L

tria

ls

p

mN

mp

pp

mP

−−

=)

1()

|(

Prob

. of o

bser

ving

a sp

ecifi

cse

quen

ce o

f m h

eads

Find

a

such

that

the

abov

e pr

ob.

is m

axim

ized

)|

(lo

g=

∂∂

pp

pm

PN

mp

=p

[])ˆ

1lo

g()ˆ

1(ˆ

log

ˆ)ˆ

|(

log

pp

pp

Np

mP

−−

+= -ent

ropy

Page 4: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

How

goo

d is

the

estim

ate?

Dis

tribu

tion

of

u

nder

repe

ated

sam

plin

g

Cen

tral l

imit

theo

rem

di

strib

utio

n of

m a

ppro

ache

s nor

mal

for l

arge

N

)1(

~p

Np

Np

m−

±

Np

pp

p/)

1(~

ˆ−

±

Thus

the

estim

ate

conv

erge

s to

the

real

p w

ith a

squa

re-r

oot c

onve

rgen

ce

Page 5: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

Max

imum

like

lihoo

d A

ppro

ach

Exam

ple

B:

Nxx

x,..

.,,

21 inde

pend

ent a

nd id

entic

ally

dis

tribu

ted

(i.i.d

) sam

ple

draw

n fr

om a

nor

mal

dis

tribu

tion

Estim

ate

the

mea

n an

d th

e va

rianc

e

Max

imiz

ing

the

likel

ihoo

d fu

nctio

n (s

how

this

is tr

uein

the

hom

ewor

k )

),

(2

σµ

N

Nx

x

Nx

x

N ii

N ii

/)

2

1

2

1

−=

==

=

=

σµ

Page 6: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

Gen

eral

form

ulat

ion

of

the

max

imum

like

lihoo

d ap

proa

ch

D:

obse

rved

dat

aM

: th

e st

atis

tical

mod

elpa

ram

eter

s of t

he m

odel

prob

abili

ty o

f obs

ervi

ng th

e da

ta

give

n th

e m

odel

and

par

amet

ers

the

likel

ihoo

d of

as a

func

tion

of d

ata

Max

imum

like

lihoo

d es

timat

e of

the

para

met

ers

θ

),

|(

θM

DP

θ)

,|

()

;(

θθ

MD

PD

L≡

);

(m

axar

DL

θθ

=

Theo

rem

:

conv

erge

s to

the

true

in th

e la

rge

sam

ple

limit

with

err

or

θ̂0θ

N/1

~

Page 7: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

Exam

ple

C:

Segm

enta

tion

a se

quen

ce o

f hea

d (1

) an

d ta

il (0

) is g

ener

ated

by

first

usin

g a

coin

with

an

d th

en c

hang

e to

a c

oin

with

th

e ch

ange

poi

nt u

nkno

wn

Dat

a =

(001

0100

0000

0010

1111

0111

1100

010)

1p2p

)(

2)

(2

)(

1)

(1

21

22

11

)1(

)1(

),

|,

(x

mx

Nx

mx

mx

xm

pp

pp

pp

xseq

P−

−−

−−

=

posi

tion

right

bef

ore

the

chan

ge

num

ber o

f 1’s

up

to x

num

ber o

f 1’s

afte

r x

tota

l num

ber o

f tos

ses

x

)( 1x

m

)(

2x

m N

Page 8: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

Exam

ple

C c

ontin

ued

For f

ixed

max

imiz

e

w

ith re

spec

t to

a

ndx

),

|,

(2

1p

px

seq

P1p

2p

[]

[] )

ˆ1

log(

1(ˆ

log

ˆ)

()

ˆ1

log(

1(ˆ

log

ˆ)

ˆ ,ˆ

|,

(lo

g

22

22

11

11

21

pp

pp

xN

pp

pp

xp

px

seq

P−

−+

−+

−−

+=

)/()

/)(

ˆ

22

11

xN

xm

px

xm

p−

==

Then

max

imiz

e

with

resp

ect t

o )

ˆ ,ˆ

|,

(2

1p

px

seq

Px

The

abov

e ap

proa

ch is

som

etim

e re

ferr

ed a

s “en

tropi

cse

gmen

tatio

n”, a

s it

tries

to m

inim

ize

the

tota

l ent

ropy

A g

ener

aliz

atio

n of

the

abov

e m

odel

to 4

alp

habe

t and

unk

now

n nu

mbe

rof

bre

akin

g po

ints

can

be

used

to se

gmen

t DN

A se

quen

ces i

nto

regi

ons

of d

iffer

ent c

ompo

sitio

n. m

ore

natu

rally

des

crib

ed b

y a

hidd

en M

arko

v m

odel

.

Page 9: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

Exam

ple

D: d

etec

ting

wea

k co

mm

on se

quen

ce p

atte

rns i

n a

set o

f rel

ated

sequ

ence

s

e.g.

, loc

al se

quen

ce m

otifs

for f

unct

iona

lly o

r stru

ctur

ally

rela

ted

prot

eins

(no

over

all s

eque

nce

sim

ilarit

y)

regu

lato

ry e

lem

ents

in th

e up

stre

am re

gion

s of

co-r

egul

ated

gen

es, c

ould

be

gene

s clu

ster

ed to

geth

erby

mic

roar

ray

data

the

sim

ples

t situ

atio

n: e

ach

sequ

ence

con

tain

one

real

izat

ion

of th

em

otif

with

giv

en le

ngth

, but

the

star

ting

posi

tions

are

unk

now

n

Page 10: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

YA

R07

1W:6

00:-6

00

\cat

caag

atga

gaaa

ataa

aggg

atttt

ttcgt

tcttt

tatc

atttt

ctct

ttctc

acttc

cgac

tact

tctta

tatc

tact

ttcat

cgttt

cattc

atcg

tggg

tgtc

taat

aaag

tttta

atga

caga

gata

acct

tgat

aagc

tttttc

ttata

cgct

gtgt

cacg

tattt

atta

aatta

ccac

gtttt

cgca

taac

attc

tgta

gttc

atgt

gtac

taaa

aaaa

aaaa

aaaa

aaa

gaaa

tagg

aagg

aaag

agta

aaaa

gtta

atag

aaaa

caga

acac

atcc

ctaa

acga

agcc

gcac

aatc

ttggc

gttc

acac

gtgg

gttta

aaaa

ggca

aatta

caca

gaa

tttca

gacc

ctgt

ttacc

ggag

agat

tcca

tattc

cgca

cgtc

acat

tgcc

aaat

tggt

catc

tcac

caga

tatg

ttata

cccg

ttttg

gaat

gagc

ataa

acag

cgtc

gaa

ttgcc

aagt

aaaa

cgta

tata

agct

ctta

cattt

cgat

agat

tcaa

gctc

agttt

cgcc

ttggt

tgta

aagt

agga

agaa

gaag

aaga

agaa

gagg

aaca

acaa

cagc

aaa

gaga

gcaa

gaac

atca

tcag

aaat

acca

\Y

BR

092C

:600

:-600

\a

atca

atga

cttc

tacg

acta

tgct

gaaa

agag

agta

gccg

gtac

tgac

ttcct

aaag

gtct

gtaa

cgtc

agca

gcgt

cagt

aact

ctac

tgaa

ttgac

cttc

tact

ggga

ctg

gaac

acta

ctca

ttaca

acgc

cagt

ctat

tgag

acaa

tagt

tttgt

ataa

ctaa

ataa

tattg

gaaa

ctaa

atac

gaat

accc

aaat

ttttta

tcta

aattt

tgcc

gaaa

gatta

aaat

ctgc

agag

atat

ccga

aaca

ggta

aatg

gatg

tttca

atcc

ctgt

agtc

agtc

agga

accc

atat

tata

ttaca

gtat

tagt

cgcc

gctta

ggca

cgcc

tttaa

ttagc

aaa

atca

aacc

ttaag

tgca

tatg

ccgt

ataa

ggga

aact

caaa

gaac

tggc

atcg

caaa

aatg

aaaa

aaag

gaag

agtg

aaaa

aaaa

aaaa

ttcaa

aaga

aattt

acta

aata

atac

cagt

ttggg

aaat

agta

aaca

gcttt

gagt

agtc

ctat

gcaa

cata

tata

agtg

ctta

aattt

gctg

gatg

gaag

tcaa

ttatg

ccttg

atta

tcat

aaaa

aaaa

tact

acag

taaa

gaaa

gggc

cattc

caaa

ttacc

t\Y

BR

093C

:600

:-600

\c

gcta

atag

cggc

gtgt

cgca

cgct

ctct

ttaca

ggac

gccg

gaga

ccgg

catta

caag

gatc

cgaa

agttg

tattc

aaca

agaa

tgcg

caaa

tatg

tcaa

cgta

tttgg

aagt

catc

ttatg

tgcg

ctgc

tttaa

tgttt

tctc

atgt

aagc

ggac

gtcg

tcta

taaa

cttc

aaac

gaag

gtaa

aagg

ttcat

agcg

ctttt

tcttt

gtct

gcac

aaag

aaat

ata

tatta

aatta

gcac

gtttt

cgca

taga

acgc

aact

gcac

aatg

ccaa

aaaa

agta

aaag

tgat

taaa

agag

ttaat

tgaa

tagg

caat

ctct

aaat

gaat

cgat

acaa

ccttg

gcac

tcac

acgt

ggga

ctag

caca

gact

aaat

ttatg

attc

tggt

ccct

gtttt

cgaa

gaga

tcgc

acat

gcca

aatta

tcaa

attg

gtca

cctta

cttg

gcaa

ggca

tata

ccc

atttg

ggat

aagg

gtaa

acat

ctttg

aattg

tcga

aatg

aaac

gtat

ataa

gcgc

tgat

gtttt

gcta

agtc

gagg

ttagt

atgg

cttc

atct

ctca

tgag

aata

agaa

caa

caac

aaat

agag

caag

caaa

ttcga

gatta

cca\

YB

R29

6C:6

00:-6

00

\gaa

atct

cggt

ttcac

ccgc

aaaa

aagt

ttaaa

tttca

caga

tcgc

gcca

cacc

gatc

acaa

aacg

gcttc

acca

caag

ggtg

tgtg

gctg

tgcg

atag

acct

tttttt

tctt

tttct

gcttt

ttcgt

catc

ccca

cgttg

tgcc

atta

atttg

ttagt

gggc

cctta

aatg

tcga

aata

ttgct

aaaa

attg

gccc

gagt

cattg

aaag

gcttt

aaga

atat

accg

tac

aaag

gagt

ttatg

taat

ctta

ataa

attg

cata

tgac

aatg

cagc

acgt

ggga

gaca

aata

gtaa

taat

acta

atct

atca

atac

taga

tgtc

acag

ccac

tttgg

atcc

ttcta

ttatg

taaa

tcat

taga

ttaac

tcag

tcaa

tagc

agat

tttttt

taca

atgt

ctac

tggg

tgga

catc

tcca

aaca

attc

atgt

cact

aagc

ccgg

ttttc

gata

tgaa

gaaa

atta

tat

ataa

acct

gctg

aaga

tgat

cttta

cattg

aggt

tattt

taca

tgaa

ttgtc

atag

aatg

agtg

acat

agat

caaa

ggtg

agaa

tact

ggag

cgta

tcta

atcg

aatc

aata

taa

acaa

agat

taag

caaa

aatg

\

Exam

ple:

22

gene

s ide

ntifi

ed a

s pho

4 ta

rget

by

mic

roar

ray,

O’s

hea

lab

Page 11: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

A m

odel

for t

he m

otif

AA

ATG

A

AG

GTC

C

AG

GA

TG AG

AC

GT

alig

nmen

t m

atrix

1 2

3

4

5

6

A4

1

2

1

0

1

C0

0

0

1

1

1

G0

3

2

0

2

1

T

0

0

0

2

1

1

Page 12: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

posi

tion

spec

ific

prob

abili

ty m

atrix

1 2

3

4

5

6A

1.00

0

.25

0

.50

0.

25

0.0

0

0.2

5C

0.00

0

.00

0

.00

0.

25

0.2

5

0.2

5G

0.00

0

.75

0.

50

0.0

0

0.50

0

.25

T

0.0

0

0.00

0

.00

0.

50

0.2

5

0.2

5

σ,if

Mod

el:

prob

abili

ty o

f obs

ervi

ng c

erta

in b

ase

insi

deth

e m

otif

is g

iven

by

the

abov

e m

atrix

prob

abili

ty o

f obs

ervi

ng c

erta

in b

ase

outs

ide

the

mot

if is

giv

en b

y th

e ba

ckgr

ound

freq

uenc

y 0 σf

Page 13: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

),..

.,,

(2

1Nx

xx

x=

vSt

artin

g po

sitio

ns o

f the

mot

if un

know

n

Posi

tion

spec

ific

pro

babi

lity

mat

rix u

nkno

wn

need

to b

e in

ferr

ed fr

om th

e ob

serv

ed se

quen

ce d

ata

∏∏

∏∏

=

− =

−+ =

+=

+−

=N i

x j

wx

xj

L

wx

jx

ji

ii

ii

ijij

iij

ff

ff

xseq

P1

1 1

10

,10

,)

|,

σσ

σv

σ,if

ijwLN σ

Num

ber o

f seq

uenc

esLe

ngth

of t

he se

quen

ceW

idth

of t

he m

otif

Bas

e of

sequ

ence

i at

pos

ition

j

()

)(

1,

,

0,)

|,

(x

nw j

ffi

j

jconst

fx

seq

Pv

σσ

σσ

∏∏

=

=

)(

,x

n jv

σTo

tal n

umbe

r of c

ount

for b

ase

at

posi

tion

j in

the

alig

nmen

likel

ihoo

d ra

tio

Page 14: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

Max

imiz

ing

w.r.

t.

W

ith

fix

ed

()

∑ =

=w j

ffj

ij

fN

fx

seq

P1

ˆ

,,

0,lo

|,

(lo

gσσ

σσ

v

∑=

σσ

σσ

)()

(,

,

xn

xn

jj

jf

v

v

xvσ,if

log

likel

ihoo

d ra

tiore

lativ

e en

tropy

Then

max

imiz

e th

e ab

ove

rela

tive

entro

py w

.r.t

Alig

nmen

t pat

h.

)|

,(

,σifx

seq

Pv

xv

in re

ality

, thi

s for

mul

a is

mod

ified

by a

ddin

g ps

eudo

cou

nts d

ue to

B

aysi

an e

stim

ate

Page 15: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

Stor

mo-

Har

tzel

l Alg

orith

m: C

onse

nsus

each

of t

he le

ngth

w

subs

tring

sof t

he fi

rst s

eque

nce

are

alig

ned

agai

nst a

ll th

e su

bstri

ngso

f the

sam

e le

ngth

in th

e se

cond

sequ

ence

, mat

rices

der

ived

, N to

p m

atric

esw

ith h

ighe

st in

form

atio

n co

nten

ts a

re sa

ved

the

next

sequ

ence

on

the

list i

s add

ed to

the

anal

ysis

, all

the

mat

rices

save

d pr

evio

usly

are

pai

red

with

the

subs

tring

sof

the

adde

d se

quen

ce a

nd to

p N

mat

rices

save

d

repe

at th

e pr

evio

us st

ep u

ntil

all t

he se

quen

ces h

ave

been

pro

cess

ed

Page 16: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

MA

TRIX

1nu

mbe

r of s

eque

nces

= 2

2in

form

atio

n =

8.80

903

ln(p

-val

ue) =

-153

.757

p-

valu

e =

1.67

566E

-67

ln(e

xpec

ted

freq

uenc

y) =

-13.

357

exp

ecte

d fr

eque

ncy

= 1.

5816

5E-0

6

A |

6

5 2

0 3

0

3

0

0 0

6

G |

11

0

0 5

22

0 2

1 1

5 1

4 2

C |

4 1

7

0 1

4 0

0

1

2

8

1T

| 1

0

2

0

0 1

9 0

5

0

13G

C

A

C

G

T G

G

G

T

1|1

:

1/31

7 A

CA

CG

TGG

GT

2|2

:

2/55

A

AA

GG

TCTG

T3|

3 :

3/

347

AC

AC

GTG

GG

A4|

4 :

4/

274

GC

AC

GTG

GG

A5|

5 :

5/

392

CA

AC

GTG

TCT

6|6

:

6/39

5 A

CA

AG

TGG

GT

7|7

:

7/32

1 A

CA

CG

TGG

GA

8|8

:

8/53

6 G

CA

AG

TGG

CT

9|9

:

9/17

7 G

CTG

GTG

TGT

10|1

0 :

10/

443

GC

AC

GTG

TCT

11|1

1 :

11/

14

CC

AG

GTG

CC

T12

|12

: 1

2/50

2 G

AA

AG

AG

GC

A13

|13

: 1

3/35

4 G

CA

CG

AG

GG

A14

|14

: 1

4/25

7 G

CA

CG

TGC

GA

15|1

5 :

15/

358

TC

AC

GTG

TGT

16|1

6 :

16/

316

AC

AC

GTG

GG

T17

|17

: 1

7/47

9 G

CA

CG

TGG

CT

18|1

8 :

18/

227

GA

TGG

TGG

CT

19|1

9 :

19/

186

GC

AC

GTG

GG

G20

|20

: 2

0/32

6 G

AA

GG

AG

GG

G21

|21

: 2

1/30

7 C

CA

CG

TGG

GC

22|2

2 :

22/

255

CC

AC

GTG

GC

T

Con

sens

us o

utpu

t for

Pho

4 re

gula

ted

gene

s

Page 17: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

Max

imum

like

lihoo

d es

timat

e w

ith m

issi

ng d

ata

Gen

eral

form

ulat

ion

Expe

ctat

ion

and

Max

imiz

atio

n (E

M) a

lgor

ithm

Mis

sing

dat

a: in

exa

mpl

e C

, the

poi

nt w

here

the

coin

is c

hang

edin

exa

mpl

e D

, the

star

ting

posi

tions

of t

he m

otif

in th

e m

axim

um li

kelih

ood

appr

oach

, the

re is

a c

ruci

al d

istin

ctio

nbe

twee

n pa

ram

eter

s (po

pula

tion)

such

as t

he p

ositi

on sp

ecifi

cpr

obab

ility

mat

rix a

nd th

e m

issi

ng d

ata,

sinc

e m

issi

ng d

ata

grow

with

the

sam

ple

size

and

in g

ener

al c

an n

ot b

e re

cove

red

prec

isel

yev

en if

the

sam

ple

size

goe

s to

infin

ity

For m

any

prob

lem

s, it

is n

eces

sary

to su

m o

ver a

ll m

issi

ng d

ata

∑=

yy

xP

xL

)|

,(

);

θ

Whe

re

is

the

obse

rved

dat

a an

d

is t

he m

issi

ng d

ata

xy

Page 18: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

To e

stim

ate

the

para

met

ers,

one

max

imiz

e th

e lik

elih

ood

func

tion

how

ever

, it i

s ofte

n di

ffic

ult t

o pe

rfor

m th

e su

mm

atio

nov

er m

issi

ng d

ata

expl

icitl

y)

;(

θx

L Expe

ctat

ion

Max

imiz

atio

n (E

M) a

lgor

ithm

Impr

ove

the

estim

ate

of t

he p

aram

eter

s ite

rativ

ely

Giv

en a

n es

timat

e

fi

nd

that

incr

ease

s the

like

lihoo

d fu

nctio

n

E st

ep: c

alcu

late

the

Q fu

nctio

n, th

e ex

pect

atio

n of

over

mis

sing

dat

a w

ith p

rob.

giv

en b

y th

e cu

rren

t par

amet

er

M st

ep: m

axim

ize

the

Q fu

nctio

n to

get

an

new

est

imat

e

)|

,(

log

),

|(

)|

θθ

θy

xP

xy

PQ

y

tt

∑≡

1+tθ

)|

,(

log

θy

xP

)|

(m

axar

g1

tt

θθ

=+

Page 19: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

),

|(

),

|(

log

),

|(

)|

()

|(

)|

(lo

g)

|(

log

θθθ

θθ

θθ

θθ

xy

Pxy

P

y

t

tt

tt

t

xy

P

QQ

xP

xP

∑+

−=

)|

()

|(

)|

(lo

g)

|(

log

tt

tt

QQ

xP

xP

θθ

θθ

θθ

−≥

That

the

EM a

lgor

ithm

alw

ays i

ncre

ase

the

likel

ihoo

d fu

nctio

nC

an b

e pr

oved

by

the

follo

win

g eq

uatio

n an

d in

equa

lity

Page 20: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

mot

if1m

otif2

mot

if3

A set of

Regulatory

Sequences

How

do

we

find

thes

e m

otifs

?

Exam

ple

E: id

entif

ying

com

bina

toria

l mot

ifsby

wor

d se

gmen

tatio

n

Page 21: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

chapterptgpbqdrftezptqtasctmvivwpecjsnisrmbtqlmlfvetl

loom

ingsfkicallxjgkm

ekysjerishmaeljplfsom

eylqyearstvh

njbagoaxhjtjcokhvneverpm

qpmindhowzrbdlzjllonggbhqi

preciselysunpvskepfdjktcgarwtnxybgcvdjfbnohavinglittl

ezorunozsoyapm

oneyyvugsgtsqintmyteixpurseiwfmjwgj

nyyveqxwftlamnbxkrsbkyandrnothingcgparticularwtzao

qsjtnmtoqsnwvxfiupinterestztimebymonlnshoreggditho

ughtyxfxm

hqixceojjzdhwouldsailpcaboutudxsbsnewtpg

gvjaasxm

svlittleplvcydaowgwlbzizjlnzyxandzolwcudthjd

osbopxkkfdosxardgcseebbthefzrsskdhmawateryjikzicim

ypartmofprtheluworldvtoamfutitazpisagwewayrqbkiosh

avebojwphiixofprmalungipjdrivingpkuyoikrwxoffodhicb

nimtheixyucpdzacemspleenqbpcrmhwvddyaiwnandada

bkpgzmptoregulatingeetheslcirculationvsuctzwvfyxstuzr

dfwvgygzoejdfmbqescwheneverpitfindfmyselfcgrowingne

ostumrydrrthmjsmgrimcczhjmgbkwczoaboutjbwanbwzq

thehrjvdrcjjgmouthuutwheneveritddfouishlawwphxnae

Page 22: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

Bus

sem

aker

/Li/S

iggi

a M

odel

: Pr

obab

ilist

ic S

egm

enta

tion/

Max

imum

like

lihoo

d

A p

roba

bilis

tic d

ictio

nary

Wor

dspr

obab

ilitie

s

A

P AC

P C

G

P GT

P TG

C

P G

CTA

TAA

P T

ATA

A

A| G

| T |

A| T

| A

|A

|G

| C

A| G

|T

A

T A

A

| G

C

A |

G|T

A

T

A

A| G

| C

nww

wSeg

wP

PP

PZ

...3

21

∑=

max

imiz

ing

the

likel

ihoo

d fu

nctio

n

wor

d bo

unda

ryis

mis

sing

Page 23: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

Dic

tiona

ry C

onst

ruct

ion

Para

met

er in

fere

nce:

giv

en th

e en

tries

in th

e di

ctio

nary

, fin

d P W

bym

axim

izin

g th

e lik

elih

ood

func

tion.

Sta

rting

with

a si

mpl

edi

ctio

nary

with

all

poss

ible

wor

ds

Mod

el im

prov

emen

t: do

stat

istic

al te

st o

n lo

nger

wor

ds b

ased

on

the

curr

ent d

ictio

nary

, add

the

ones

that

are

ove

r-re

pres

ente

dre

-ass

ign

P Wby

max

imiz

ing

the

likel

ihoo

d fu

nctio

n

Itera

te th

e ab

ove

Page 24: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

)(

)(

)};

({Seg

N

Seg

ww

wp

seq

pL

∑∏

= ∑=

+w

twt

w N

Nwt

p)1

(

)(Seg

Nw

EM a

lgor

ithm

for t

he w

ord

segm

enta

tion

Num

ber o

f wor

d w

in a

giv

en se

gmen

tatio

n

ww

tw

ww

pn

tp

pQ

log

)})

({|}

({∑

=E

step

M st

ep

Page 25: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

Ditionary1

Ditionary2

Dictionary3

-----------------------------------------------------------------------------

e

0.065239

e

0.048730

e

0.042774

t

0.055658

s

0.042589

s

0.040843

a

0.052555

a

0.040539

a

0.038595

o

0.050341

t

0.040442

i

0.036897

n

0.049266

i

0.038550

t

0.036871

i

0.048101

d

0.038547

d

0.036323

s

0.047616

o

0.036486

l

0.035336

h

0.047166

l

0.036300

c

0.034818

r

0.043287

g

0.034509

m

0.034650

l

0.041274

r

0.034496

y

0.034482

d

0.039461

c

0.033916

b

0.034396

u

0.034742

m

0.033724

r

0.034105

m

0.034349

n

0.033321

p

0.034044

g

0.034001

y

0.033227

w

0.033819

w

0.033967

p

0.033156

n

0.033817

c

0.032934

f

0.032863

g

0.033676

f

0.032597

b

0.032780

f

0.033534

y

0.031776

w

0.032009

o

0.033206

p

0.031711

h

0.031494

h

0.033200

b

0.031409

v

0.030727

k

0.032103

v

0.028268

k

0.030445

v

0.031498

k

0.028113

u

0.030379

j

0.031209

j

0.026712

j

0.029268

u

0.031186

q

0.026561

z

0.028905

z

0.031003

z

0.026542

x

0.028404

x

0.030544

x

0.026357

q

0.028123

q

0.030244

th

0.009954

the

0.005715

in

0.006408

ing

0.003237

er

0.004755

and

0.003128

an

0.004352

in

0.002968

ou

0.003225

ed

0.002547

on

0.003180

to

0.002496

he

0.003108

of

0.002486

at

0.002851

en

0.001331

ed

0.002804

an

0.001313

or

0.002786

th

0.001270

en

0.002538

er

0.001250

to

0.002511

es

0.001209

of

0.002475

at

0.001181

st

0.002415

it

0.001171

nd

0.002297

that 0.001165

Page 26: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

Words

<Nw>

quality factor

--------------------------------------------------------------------------------

abominate 2.0000 1.0000

achieved 2.0000 1.0000

aemploy 2.0000 1.0000

affrighted 2.0000 1.0000

afternoon 2.0000 1.0000

afterwards 5.0000 1.0000

ahollow 2.0000 1.0000

american 3.0000 1.0000

anxious 2.0000 1.0000

apartment 2.0000 1.0000

appeared 4.0000 1.0000

astonishment 4.0000 1.0000

attention 2.0000 1.0000

avenues 2.0000 1.0000

bashful 2.0000 1.0000

battery 2.0000 1.0000

beefsteaks 2.0000 1.0000

believe 2.0000 1.0000

beloved 2.0000 1.0000

beneath 6.0000 1.0000

between 12.0000 1.0000

boisterous 3.0000 1.0000

botherwise 2.0000 1.0000

bountiful 2.0000 1.0000

bowsprit 2.0000 1.0000

breakfast 5.0000 1.0000

breeding 2.0000 1.0000

bulkington 3.0000 1.0000

bulwarksb 2.0000 1.0000

bumpkin 2.0000 1.0000

business 6.0000 1.0000

carpenters 2.0000 1.0000

Page 27: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

Tab

le 1

.K

now

n ce

ll cy

cle

site

s an

d so

me

met

abol

ic si

tes t

hat

mat

ch w

ords

from

our

geno

mew

ide

dict

iona

ry

MC

BA

CG

CG

TA

AA

CG

CG

TA

CG

CG

TCG

CG

T C

GC

GA

CG

CG

TTG

AC

GC

GT

SCB

CR

CG

AA

AA

CG

CG

AA

A

SCB

'A

CR

MSA

AA

AC

GC

GA

AA

AC

GC

CA

AA

AA

CG

CC

AA

Swi5

RR

CC

AG

CR

GC

CA

GC

GG

CA

GC

CA

G

SIC

1G

CSC

RG

CG

CC

CA

GC

C C

CG

CG

CG

G

MC

M1

TTW

CC

YA

AW

NN

GG

WA

ATT

TCC

NN

NN

NN

GG

AA

A

NIT

GA

TAA

TTG

ATA

ATG

M

ETTC

AC

GTG

RTC

AC

GTG

TCA

CG

TGM

CA

CG

TGA

C C

AC

GTG

CT

PDR

TCC

GC

GG

ATC

CG

CG

G

HA

PC

CA

AY

AA

CC

CA

AC

MIG

1K

AN

WW

WW

ATS

YG

GG

GW

TATA

TGTG

CA

TATA

TGG

TGG

GG

AG

GA

L4C

GG

N11

CC

GC

GG

N11

CC

G

Page 28: BP-203 Foundations for Mathematical Biology Statistics Lecture …mobydick.ucsf.edu/~haoli/biomath_3.pdf · 2005-08-23 · tries to minimize the total entropy A generalization of

our d

ictio

nary

vs.

know

n TF

bin

ding

site

s

Yea

st p

rom

oter

dat

abas

e 44

3 no

n-re

dund

ant s

ites

(Zhu

and

Zha

ng, c

old

sprin

g ha

rbor

) 9 (2

.9)

30B

razm

a et

al.

14 (

3.3)

33Sc

ram

bled

dic

tiona

ry

25 (

4.8)

114

Our

dic

tiona

ry

Expe

cted

(s

tand

ard

devi

atio

n)#

of m

atch

es