rutgers cs440, fall 2003 introduction to statistical learning reading: ch. 20, sec. 1-4, aima 2 nd...

Rutgers CS440, Fall 2003

Introduction to Statistical Learning

Reading: Ch. 20, Sec. 1-4, AIMA 2nd Ed.


Learning under uncertainty

• How to learn probabilistic models such as Bayesian networks, Markov models, HMMs, …?

• Examples:– Class confusion example: how did we come up with the CPTs?

– Earthquake-burglary network structure?

– How do we learn HMMs for speech recognition?

– Kalman model (e.g., mass, friction) parameters?

– User models encoded as Bayesian networks for HCI?


Hypotheses and Bayesian theory

• Problem: – Two kinds of candy, lemon and chocolate– Packed in five types of unmarked bags:

(100% C, 0% L) 10% of time, (75% C, 25% L) 10% time, (50% C, 50% L) 40% of time, (25% C, 75% L) 20% of time, (0% C, 100%L) 10% of time

– Task: open a bag, ( unwrap candy, observe it, …), then predict what the next one will be

• Formulation:

– H (Hypothesis): h1 (100,0) or h2 (75,25) or h3 (50,50) or h4 (25,75) or h5(0,100)

– di (Data):i-th open candy: L (lemon) or C (chocolate)

– Goal:• predict di+1 after seeing D = { d0, d1, …, di }• P( di+1 | D )


Bayesian prediction properties

• True hypothesis eventually dominates

• Bayesian prediction is optimal (minimizes prediction error)

• Comes at a price: usually many hypotheses, intractable summation


Approximations to Bayesian prediction

• MAP – Maximum a posteriori

P( d | D ) = P( d | hMAP ), hMAP = arg maxhi P( hi | D )

(easier to compute)

• Role of prior, P(hi): penalizes complex hypotheses

• ML – Maximum likelihood

P( d | D ) = P( d | hML ), hML = arg maxhi P( D | hi )


Learning from complete data

• Learn parameters of Bayesian models from data – e.g., learn probabilities of C & L for a bag of candy whose proportions of

C&L are unknown by observing opened candy from that bag

• Candy problem parameters:

u – probability of C in bag uu – probability of bag u

Candy Parameter

C u

L 1- u

Bag Parameter

1 1

2 2

3 3

4 4

5 5

Bag u


ML Learning from complete data

• ML approach: select model parameters to maximize likelihood of seen data

1. Need to assume distribution model that determines how the samples (of candy) are distributed in a bag

2. Select parameters of the model that maximize the likelihood of the seen data

uu

Ldu

Cduu

N

iuiuu

N

iuiu

hCdP

hdP

hdPhDPhDL

hdPhDP

)|(

)1()|(

)|(log)|(log)|(

)|()|(

1

1

likelihood

model: binomial

log-likelihood


Maximum likelihood learning (binomial distribution)

• How to find a solution to the above problem?

)1log()(log)(maxarg

)1(logmaxarg

)|(logmaxarg)|(maxarg

1

1

*

1

*

LdCd

hdPhDLh

i

N

ii

h

LdCdN

ih

N

iui

hu

hu

u

ii

u

uu


Maximum likelihood learning (cont’d)

• Take the first derivative of (log) likelihood and set it to zero

0)(1

1)(

10

1

1)(

1)(

1*

1*

1

*

N

ii

N

ii

N

iii

LdCdL

LdCdL

N

iiNN

ii

N

ii

N

ii

CdLdCd

Cd

1

1

11

1* )()()(

)(

• Counting!

01N

………

Total:

102

011

d=Ld=CSample

N

ii Cd

1

)(

N

ii Ld

1

)(


Naïve Bayes model

• One set of causes, multiple independent sources of evidence

C

E1 E2 EN

• Example: C {spam, not spam}, Ei { token i present, token i absent }

Ei

)()|(),,...,,(1

21 CPCEPCEEEPN

iiN

• Limiting assumption, often works well in practice

… …


Inference & Decision in NB model

• Inference

)(log)|(log~),...,,|(log

)()|(~),...,,|(

121

121

CPCEPEEECP

CPCEPEEECP

N

iiN

N

iiN

Evidence scoreHypothesis (class) score Prior score

• Decision

0)_|(

)|(log

0),...,,|_(log),...,,|(log

),...,,|_(),...,,|(

?

1

?

2121

21

?

21

N

i i

i

NN

NN

SPAMNOTCEP

SPAMCEP

EEESPAMNOTCPEEESPAMCP

EEESPAMNOTCPEEESPAMCP

Log odds ratio


Learning in NB models

• Example:

Given a set of K email messages, each with tokens D={ dj = (e1j,…,eNj) }, eij {0,1}, and labels C={cj} (SPAM or NOT_SPAM), find the best set of CPTs P(E i|C) and P(C)

Assume: P(Ei|C=c) is binomial with parameter i,c, P(C) is binomial with parameter c

cC

cC

eci

ecii

cCP

cCeEP

1

1,,

)1()(

)1()|(

• ML learning: maximize likelihood of K messages, each one in one of the two classes

K

j

N

i

cC

cC

eci

eci

jjijij

CcNc

P1 1

11,,

,,...,)1(log)1(log)(logmax

,,1

CD,

2N+1 parameters

Label of message j

Token i in message j present/absent

K

jjKC

K

jijKci ce

1

1

1

1,


Learning of Bayesian network parameters

• Naïve Bayes learning can be extended to BNs! How?• Model each CPT as binomial/multinomial distribution. Maximize

likelihood of data given BN.

earthquake burglary

alarm

callnewscast

abe

abe

bB

bB

eE

eE

eEbBaAP

bBP

eEP

1

1

1

)1(),|(

)1()(

)1()(

Sample E B A N C

1 1 0 0 1 0

2 1 0 1 1 0

3 1 1 0 0 1

4 0 1 0 1 1

N

iii

N

iiii

be

eEbB

eEbBA

1

1

),(

),,1(


BN Learning (cont’d)

• Issues:

1. Priors on parameters. What if ? Should we trust it?

Maybe always add some small pseudo-count ?

2. How do we learn a BN graph (structure)?

Test all possible structures, then pick the one with the highest data likelihood?

3. What if we do not observe some nodes (evidence not on all nodes)?

0),,1(1

N

iiii eEbBA

0,),,1(1

bebe

N

iiii eEbBA


Learning from incomplete data

• Example: – In the alarm network, we received data where we only know Newscast, Call,

Earthquake, Burglary, but have no idea what Alarm state is.– In SPAM model, we do not know if a message is spam or not (missing label).

Sample E B A N C1 1 0 N/A 1 02 1 0 N/A 1 03 1 1 N/A 0 14 0 1 N/A 1 1

• Solution?We can still try to find network parameters that maximize likelihood of incomplete data.

N

i hi

N

ii

h

hdP

dPDLu

1

1

*

)|,(logmaxarg

)|(logmaxarg)|(maxarg

Hidden variable


Completing the data

• Maximizing incomplete data likelihood is tricky.• If we could, somehow, complete the data we would know how

to select model parameters that maximize the completed data.• How do we complete the missing data?

1. Randomly complete?

2. Estimate missing data from evidence, P( h | Evidence ).

11…104

10…113

01…012

01P( a=0 | E=1,B=0,N=1,C=0 )011.0

CNABESample

01P( a=1 | E=1,B=0,N=1,C=0 )011.1


EM Algorithm

• With completed data, Dc, maximize completed (log)likelihood by weighting contribution from each sample with P(h|d)

N

i hiicc dhPhdPDL

1

* ),|()|,(logmaxarg)|(maxarg

N

i aiiiiiiii

N

iiiiiiiii

be

nNcCeEbBaAPeEbBaA

nNcCeEbBAPeEbBA

1

1

0

1

),,,|(),,(

),,,|1(),,1(

• E(xpectation) M(aximization) algorithm:

1. Pick initial parameter estimates 0.

2. error = Inf;

3. While (error > max error)

1. E-step: Complete data, Dc, based on k-1.

2. M-step: Compute new parameters k that maximize completed data likelihood.

3. error = L( D | k ) - L( D | k-1 )


EM Example

• Candy problem, but now we do not know which bag the candy came from (bag label missing).

• E-step:

• M-step

0),,|()(1

1),,|()(

10

),,|(1

1)(

1)(

),,|(log)1log()(log)(

1

11*

1

11*

*

1

11

1

115

1

N

i

ku

kuii

N

i

ku

kuiik

u

c

N

i

ku

kuik

uik

uik

u

c

N

i

ku

kui

u

ku

kui

kuic

duPLdduPCdL

duPLdCdL

duPLdCdL

ku

N

i

ku

kui

N

i

ku

kuii

ku

duP

duPCd

1

11

1

11

,*

),,|(

),,|()(

)|(),|(~),,|( 1111 ku

kui

ku

kui uPudPduP

N

i

ku

kuiN

ku duP

1

111,* ),,|(

Prior probability of bag uCandy (C) probability in bag u


EM Learning of HMM parameters

• HMM needs EM for parameter learning (unless we know exactly the hidden states at every time instance)– Need to learn transition and emission parameters.

• E.g.: – Learning of HMMs for speech modeling.

1. Assume a general (word/language) model.

2. E-step: Recognize (your own) speech using this model (Viterbi decoding).

3. M-step: Tweak parameters to recognize your speech a bit better (ML parameter fitting).

4. Go to 2.

rutgers cs440, fall 2003 introduction to statistical learning reading: ch. 20, sec. 1-4, aima 2 nd...

Documents