rutgers cs440, fall 2003 introduction to statistical learning reading: ch. 20, sec. 1-4, aima 2 nd...

21
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.

Post on 22-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed

Rutgers CS440, Fall 2003

Introduction to Statistical Learning

Reading: Ch. 20, Sec. 1-4, AIMA 2nd Ed.

Page 2: Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed

Rutgers CS440, Fall 2003

Learning under uncertainty

• How to learn probabilistic models such as Bayesian networks, Markov models, HMMs, …?

• Examples:– Class confusion example: how did we come up with the CPTs?

– Earthquake-burglary network structure?

– How do we learn HMMs for speech recognition?

– Kalman model (e.g., mass, friction) parameters?

– User models encoded as Bayesian networks for HCI?

Page 3: Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed

Rutgers CS440, Fall 2003

Hypotheses and Bayesian theory

• Problem: – Two kinds of candy, lemon and chocolate– Packed in five types of unmarked bags:

(100% C, 0% L) 10% of time, (75% C, 25% L) 10% time, (50% C, 50% L) 40% of time, (25% C, 75% L) 20% of time, (0% C, 100%L) 10% of time

– Task: open a bag, ( unwrap candy, observe it, …), then predict what the next one will be

• Formulation:

– H (Hypothesis): h1 (100,0) or h2 (75,25) or h3 (50,50) or h4 (25,75) or h5(0,100)

– di (Data):i-th open candy: L (lemon) or C (chocolate)

– Goal:• predict di+1 after seeing D = { d0, d1, …, di }• P( di+1 | D )

Page 4: Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed

Rutgers CS440, Fall 2003

Bayesian learning

• Bayesian solution

Estimate probabilities of hypothesis (candy bag types), then predict data (candy type)

P(hi | D ) ~ P( D | hi ) P(hi)

P(di | D ) = hi P( di | hi ) P( hi | D )

P( D | hi ) = P( d0 | hi ) x … x P(di | hi )

Hypothesis posterior Data likelihood Hypothesis prior

Prediction

I.I.D. (independently, identically distributed) data points

Page 5: Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed

Rutgers CS440, Fall 2003

Example

• P( hi ) = ?

– P( hi ) = (0.1, 0.2, 0.4, 0.2, 0.1)

• P( di | hi ) = ?

– P( chocolate | h1 ) = 1, P( lemon | h3 ) = 0.5

• P( C, C, C, C, C | h4 ) = ?

– P( C, C, C, C, C | h4 ) = 0.255

• P( h5 | C, C, C, C, C ) = ?

– P( h5 | C, C, C, C, C ) ~ P(C, C, C, C, C | h5 ) P( h5 ) = 05 0.1 = 0

• P( lemon | C, C, C, C, C ) = ?– P( d | h1) P(h1 | C, C, C, C, C ) + … + P( d | h5 ) P( h5 | C,C,C,C,C) =

0*0.6244 + 0.25*0.2963 + 0.50*0.0780 + 0.75*0.0012 + 1*0 = 0.1140

• P( chocolate | C, C, C, C, … ) = ?– P (chocolate | C, C, C, C, … ) -> 1

Page 6: Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed

Rutgers CS440, Fall 2003

Bayesian prediction properties

• True hypothesis eventually dominates

• Bayesian prediction is optimal (minimizes prediction error)

• Comes at a price: usually many hypotheses, intractable summation

Page 7: Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed

Rutgers CS440, Fall 2003

Approximations to Bayesian prediction

• MAP – Maximum a posteriori

P( d | D ) = P( d | hMAP ), hMAP = arg maxhi P( hi | D )

(easier to compute)

• Role of prior, P(hi): penalizes complex hypotheses

• ML – Maximum likelihood

P( d | D ) = P( d | hML ), hML = arg maxhi P( D | hi )

Page 8: Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed

Rutgers CS440, Fall 2003

Learning from complete data

• Learn parameters of Bayesian models from data – e.g., learn probabilities of C & L for a bag of candy whose proportions of

C&L are unknown by observing opened candy from that bag

• Candy problem parameters:

u – probability of C in bag uu – probability of bag u

Candy Parameter

C u

L 1- u

Bag Parameter

1 1

2 2

3 3

4 4

5 5

Bag u

Page 9: Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed

Rutgers CS440, Fall 2003

ML Learning from complete data

• ML approach: select model parameters to maximize likelihood of seen data

1. Need to assume distribution model that determines how the samples (of candy) are distributed in a bag

2. Select parameters of the model that maximize the likelihood of the seen data

uu

Ldu

Cduu

N

iuiuu

N

iuiu

hCdP

hdP

hdPhDPhDL

hdPhDP

)|(

)1()|(

)|(log)|(log)|(

)|()|(

1

1

likelihood

model: binomial

log-likelihood

Page 10: Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed

Rutgers CS440, Fall 2003

Maximum likelihood learning (binomial distribution)

• How to find a solution to the above problem?

)1log()(log)(maxarg

)1(logmaxarg

)|(logmaxarg)|(maxarg

1

1

*

1

*

LdCd

hdPhDLh

i

N

ii

h

LdCdN

ih

N

iui

hu

hu

u

ii

u

uu

Page 11: Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed

Rutgers CS440, Fall 2003

Maximum likelihood learning (cont’d)

• Take the first derivative of (log) likelihood and set it to zero

0)(1

1)(

10

1

1)(

1)(

1*

1*

1

*

N

ii

N

ii

N

iii

LdCdL

LdCdL

N

iiNN

ii

N

ii

N

ii

CdLdCd

Cd

1

1

11

1* )()()(

)(

• Counting!

01N

………

Total:

102

011

d=Ld=CSample

N

ii Cd

1

)(

N

ii Ld

1

)(

Page 12: Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed

Rutgers CS440, Fall 2003

Naïve Bayes model

• One set of causes, multiple independent sources of evidence

C

E1 E2 EN

• Example: C {spam, not spam}, Ei { token i present, token i absent }

Ei

)()|(),,...,,(1

21 CPCEPCEEEPN

iiN

• Limiting assumption, often works well in practice

… …

Page 13: Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed

Rutgers CS440, Fall 2003

Inference & Decision in NB model

• Inference

)(log)|(log~),...,,|(log

)()|(~),...,,|(

121

121

CPCEPEEECP

CPCEPEEECP

N

iiN

N

iiN

Evidence scoreHypothesis (class) score Prior score

• Decision

0)_|(

)|(log

0),...,,|_(log),...,,|(log

),...,,|_(),...,,|(

?

1

?

2121

21

?

21

N

i i

i

NN

NN

SPAMNOTCEP

SPAMCEP

EEESPAMNOTCPEEESPAMCP

EEESPAMNOTCPEEESPAMCP

Log odds ratio

Page 14: Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed

Rutgers CS440, Fall 2003

Learning in NB models

• Example:

Given a set of K email messages, each with tokens D={ dj = (e1j,…,eNj) }, eij {0,1}, and labels C={cj} (SPAM or NOT_SPAM), find the best set of CPTs P(E i|C) and P(C)

Assume: P(Ei|C=c) is binomial with parameter i,c, P(C) is binomial with parameter c

cC

cC

eci

ecii

cCP

cCeEP

1

1,,

)1()(

)1()|(

• ML learning: maximize likelihood of K messages, each one in one of the two classes

K

j

N

i

cC

cC

eci

eci

jjijij

CcNc

P1 1

11,,

,,...,)1(log)1(log)(logmax

,,1

CD,

2N+1 parameters

Label of message j

Token i in message j present/absent

K

jjKC

K

jijKci ce

1

1

1

1,

Page 15: Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed

Rutgers CS440, Fall 2003

Learning of Bayesian network parameters

• Naïve Bayes learning can be extended to BNs! How?• Model each CPT as binomial/multinomial distribution. Maximize

likelihood of data given BN.

earthquake burglary

alarm

callnewscast

abe

abe

bB

bB

eE

eE

eEbBaAP

bBP

eEP

1

1

1

)1(),|(

)1()(

)1()(

Sample E B A N C

1 1 0 0 1 0

2 1 0 1 1 0

3 1 1 0 0 1

4 0 1 0 1 1

N

iii

N

iiii

be

eEbB

eEbBA

1

1

),(

),,1(

Page 16: Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed

Rutgers CS440, Fall 2003

BN Learning (cont’d)

• Issues:

1. Priors on parameters. What if ? Should we trust it?

Maybe always add some small pseudo-count ?

2. How do we learn a BN graph (structure)?

Test all possible structures, then pick the one with the highest data likelihood?

3. What if we do not observe some nodes (evidence not on all nodes)?

0),,1(1

N

iiii eEbBA

0,),,1(1

bebe

N

iiii eEbBA

Page 17: Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed

Rutgers CS440, Fall 2003

Learning from incomplete data

• Example: – In the alarm network, we received data where we only know Newscast, Call,

Earthquake, Burglary, but have no idea what Alarm state is.– In SPAM model, we do not know if a message is spam or not (missing label).

Sample E B A N C1 1 0 N/A 1 02 1 0 N/A 1 03 1 1 N/A 0 14 0 1 N/A 1 1

• Solution?We can still try to find network parameters that maximize likelihood of incomplete data.

N

i hi

N

ii

h

hdP

dPDLu

1

1

*

)|,(logmaxarg

)|(logmaxarg)|(maxarg

Hidden variable

Page 18: Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed

Rutgers CS440, Fall 2003

Completing the data

• Maximizing incomplete data likelihood is tricky.• If we could, somehow, complete the data we would know how

to select model parameters that maximize the completed data.• How do we complete the missing data?

1. Randomly complete?

2. Estimate missing data from evidence, P( h | Evidence ).

11…104

10…113

01…012

01P( a=0 | E=1,B=0,N=1,C=0 )011.0

CNABESample

01P( a=1 | E=1,B=0,N=1,C=0 )011.1

Page 19: Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed

Rutgers CS440, Fall 2003

EM Algorithm

• With completed data, Dc, maximize completed (log)likelihood by weighting contribution from each sample with P(h|d)

N

i hiicc dhPhdPDL

1

* ),|()|,(logmaxarg)|(maxarg

N

i aiiiiiiii

N

iiiiiiiii

be

nNcCeEbBaAPeEbBaA

nNcCeEbBAPeEbBA

1

1

0

1

),,,|(),,(

),,,|1(),,1(

• E(xpectation) M(aximization) algorithm:

1. Pick initial parameter estimates 0.

2. error = Inf;

3. While (error > max error)

1. E-step: Complete data, Dc, based on k-1.

2. M-step: Compute new parameters k that maximize completed data likelihood.

3. error = L( D | k ) - L( D | k-1 )

Page 20: Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed

Rutgers CS440, Fall 2003

EM Example

• Candy problem, but now we do not know which bag the candy came from (bag label missing).

• E-step:

• M-step

0),,|()(1

1),,|()(

10

),,|(1

1)(

1)(

),,|(log)1log()(log)(

1

11*

1

11*

*

1

11

1

115

1

N

i

ku

kuii

N

i

ku

kuiik

u

c

N

i

ku

kuik

uik

uik

u

c

N

i

ku

kui

u

ku

kui

kuic

duPLdduPCdL

duPLdCdL

duPLdCdL

ku

N

i

ku

kui

N

i

ku

kuii

ku

duP

duPCd

1

11

1

11

,*

),,|(

),,|()(

)|(),|(~),,|( 1111 ku

kui

ku

kui uPudPduP

N

i

ku

kuiN

ku duP

1

111,* ),,|(

Prior probability of bag uCandy (C) probability in bag u

Page 21: Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed

Rutgers CS440, Fall 2003

EM Learning of HMM parameters

• HMM needs EM for parameter learning (unless we know exactly the hidden states at every time instance)– Need to learn transition and emission parameters.

• E.g.: – Learning of HMMs for speech modeling.

1. Assume a general (word/language) model.

2. E-step: Recognize (your own) speech using this model (Viterbi decoding).

3. M-step: Tweak parameters to recognize your speech a bit better (ML parameter fitting).

4. Go to 2.