lecture 20: naïve bayes - stanford university...the issue with maximum likelihood...

Lecture 20:Naïve BayesLisa YanAugust 10, 2018

Announcements

Final Exam Review:◦ Survey for concepts/questions (please fill out by Monday 8/13):

https://goo.gl/forms/UPsGbff3FY5MU5pf2Note: Last OH on Wednesday 8/15 (None on Thursday 8/16)Problem Set #6 coding question: covered today

2

https://goo.gl/forms/UPsGbff3FY5MU5pf2

Summary from last timeParameter estimation:

Given a sample of size n of IID RVs drawn from the same distribution:

• Find a “point estimate” of model parameters !• (useful for simulating new samples later on)• Intuitively, this should be the parameters that make the observed sample

the most likely.

• Joint likelihood of seeing the particular sample "#, "%, … , "':

3

( ! = * "#, "%, … , "' ! =+,-#

'* ", ! Likelihood function

Conditional densities are a functionof the chosen parameter !

Maximum Likelihood Estimator

Consider the sample of n IID RVs, !", , … , !%, where !& has density function ' !& (

Maximum Likelihood Estimator (MLE):

()*+ = arg max- . ( = arg max- .. ( = arg max- /&0"

%log ' !& (

1. Determine formula for log-likelihood .. ( based on ' !& ( .2. Differentiate .. ( and set to 0.3. Solve simultaneous equations for ()*+.

4

. ( = ' !", !4, … , !% ( =5&0"

%' !& ( Likelihood function

MLE with BernoulliConsider the sample of n IID RVs, !", !$, … , !&Assume model: !'~Ber )PMF * !' + = )-. 1 − ) "1-., where 2' = 0 or 2' = 1What is +456?

5

1. 77 +

2. Differentiate w.r.t pand set to 0:

3. Solve.

+456 = arg max8 77 + = arg max8 9':"

&log * !' +

Solution:=9

':"

&log )>. 1 − ) "1>. =9

':"

&!' log ) + (1 − !') log 1 − )

= B log ) + C − B log 1 − )D77 )D) = B

1) + C − B

−11 − ) = 0

)456 =BC =

1C9':"

&!'

(variable substitution)where B = ∑':"& !'

The issue with Maximum Likelihood Estimators!"#$ = arg max& ' !

The parameter ! making observed sample the most likely is not always the “best” !:

Consider a coin.Experiment: Flip n = 5 times

Observe 5 heads

• This is very unbelievable, because our prior knowledge leads us to think that there is some possibility of flipping tails.

• But MLE says we can only use the data to estimate !.

6

("#$ =# of successes

n= 55 = 1

' ("#$ = 1+ = 1

Solution: Maximum a Posteriori Estimators

Maximum A Posteriori Estimator

Maximum A Posteriori (MAP) Estimator:

!"#$ = arg max& ' ! (), (+, … , (- = arg max& . ! /01)

-' (0 !

If conjugate distribution known:

1. Assume imaginary trials

2. Update posterior with real experiment

3. Report mode of posterior

Laplace smoothing – Assume you have seen 1 imaginary trial per outcome.

7

Conjugate distributions• When using MAP estimators,

we have a choice of prior.

• But we need to know thedistribution of our posteriorin order to find its mode (!"#$)

Conjugate distributions:

• For a given model (e.g., Bernoulli/Multinomial/Poisson)• If we choose the conjugate distribution for the prior distribution of the

parameter ! (e.g., Beta/Dirichlet/Gamma),• We know the posterior distribution is the same (e.g., Beta/Dirichlet/Gamma),• And therefore we can find the mode of the posterior.

8

Prior: % !Likelihood: ∏'()* + ,' !Posterior: + ! ,), ,., … , ,*!"#$: arg max0 + ! ,), ,., … , ,*

Multinomial, now with MAP + Laplace SmoothingInterpreting the Multinomial MAP + Laplace:

Consider a 6-sided die.Let !~Mul@ #$, #&, #', #(, #), #*Experiment: Roll n = 12 times

Observe 3 ones, 2 twos, 0 threes, 3 fours, 1 five, 3 sixes

What is Laplace estimator for #+ for , = 1, 2, … , 6?

9

#̂$ = ($3 , #̂& ='$3 , #̂' =

$$3 , #̂( =

($3 , #̂) =

&$3 , #̂* =

($3

No longer have 0 probability of rolling a 3!

#̂+ =(# @mes outcome k occurs)+1

4 +6

Solution:

#$ = '$& , #& =&$& , #' =

7$& , #( =

'$& , #) =

$$& , #* =

'$&Recall MLE:

MAP for PoissonConjugate distribution for Poisson is Gamma(α, #).

Prior: Gamma(α, #)Prior: saw α total imaginary events during # prior time periods

Experiment: Observe $ new events during next % time periods

Posterior: Gamma(α + $, # + %)

MAP estimator:

10

'()* = arg max, - ' ./01

23 4/ '

'()* = 5# =α + $# + %

(if we choosethis prior,)

(for this model,)

(we will get thisposterior,)

(for which themode is known.)

!

Poisson with MAPConsider a Poisson.

Imaginary experiment: Measure 5 time periodsObserve 10 events total

Real experiment: Measure 2 time periodsObserve 11 events.

What is MAP estimator for !"?

11

!" = 10 + 112 + 5 =217 = 3

+,-. = !" =α + 0" + 1

Solution:

Summary of MLE/MAP

Bernoulli

"̂ = # of successesn "̂ =(# of successes)+1

n+2Binomial

"̂ = # of successesn "̂ =(# of successes)+1

n+2Multinomial

"̂) = (# :mes outcome k occurs)n "̂) =(# :mes outcome k occurs)+1

n+m

12

Maximum LikelihoodEstimator (MLE), *+,-

Maximum A Posteriori (MAP)with Laplace Smoothing, *+./

Caveat for MLE/MAP: need independence of trials!

Summary of MLE/MAP

Poisson

!" = # of successesn !" =(# of successes)+(imag. successes)

n+(imag. intervals)Normal

(̂ = )* ∑,-)* ., , (not discussed)

012 = )*∑,-)* ., − (̂ 2

Uniform

04 = min .), .2, … , .9 , (not discussed)!: = max .), .2, … , .9

13

Maximum LikelihoodEstimator (MLE), =>?@

Maximum A Posteriori(MAP), =>AB

Additional MLE caveat: suffers from small samples

Goals for today

Machine Learning◦ Introduction◦ Linear Regression◦Brute force Classification◦Naïve Bayes◦ Evaluation tools: Precision, Recall

14

”Started from the bottomnow we’re here” – Drake, 2013

The end of CS109

Counting

hereMachine Learning

Sampling use cases

15

Model of X

Sample of population

Bootstrapping

Estimates of E[X] or Var(X),

confidence interval, p-values

ParameterEstimation

Sample withfeatures and labels

Predict labelson new data

Machine Learning

What is Machine Learning?Many different forms, but we focus on the problem of prediction.Sample data is now:• Features ! =< $%, $', … , $) >• Label +Goal: Learn a function ,+ = - ! to predict +1. Training: observe relationship from pre-existing data

2. Prediction:

16

Model (with parameters),- !

! ,+

Known stock patterns:• !: economic indicators for stock• +: stock’s price tomorrow

On a new, unseen stock:• !: economic indicators for this stock• ,+: stock’s price tomorrow?

Known medical records:• !: clinical/demographic data• +: has cancer by age 40

On a new person:• !: clinical/demographic data• ,+: cancer by age 40?

Past purchases on credit card:• !: purchase details• +: fraud or not

A new purchase:• !: purchase details• ,+: fraud?

+ continuous + discrete + discrete

Spam, Spam… Go Away!

17

“And machine-learning algorithms developed to merge and rank large sets of Google search results allow us to combine hundreds of factors to classify spam.”

https://www.google.com/mail/help/intl/en-GB/fightspam/spamexplained.html

https://www.google.com/mail/help/intl/en-GB/fightspam/spamexplained.html

What is Machine Learning?Many different forms, but we focus on the problem of prediction.Sample data is now:• Features ! =< $%, $', … , $) >• Label +Goal: Learn a function ,+ = - ! to predict +1. Training: observe relationship from pre-existing data

2. Prediction: Y discrete: prediction called “classification”Y continuous: prediction called “regression”

18

Model (with parameters),- !

! ,+

!(%) =< $% = 1, $' = 3,… , $) = 3.5 >, +(%) = 1!(') =< $% = 0, $' = 1,… , $) = −1.9 >, +(') = 0

…!(7) =< $% = 1, $' = 20,… , $) = 1.2 >, +(7) = 0

Sample size N

M-dimensional IID RV (feature) 1-D dependent (label)

- !

How do we find ! " ?

Mathematically:

1. Determine a parametric form for #$ = ! " with parameters &Train on data: ' ( , * ( , ' + , * + , …, ' , , * ,2. Estimate the best parameters & according to an objective function

Y continuous: minimize mean squared error MSE “loss”, - $ − ! " +

Y discrete: maximize likelihood of Y given features ", ! " = argmax/

#0 $|2

Prediction on test data: Given a new input ", use ! " to predict the “class” #$

19

Model (with parameters),! "

" #$

Linear RegressionObserve: Single variable XOutput: Real value Y (continuous output)

Model: Assume linear: !" = $ % = &' + )Training data: N training pairs * + , - + , * . , - . , …, * / , - /

Goal: !" = &' + ) where &, ) = argmin0,1 2 " − $ %.

1. 2 " − $ % . = 2 " − &' + ) . = 2 " − &' − ) .2. Compute derivatives w.r.t & and ):440 2 " − &' − )

. = 2 −2'(" − &' − )) = −22 '" + 2&2 '. + 2)2 '441 2 " − &' − )

. = 2 −2 " − &' − ) = −22 " + 2&2 ' + 2)

20

Superscripts denotek-th training instance

(as a warmup)

parameters 8 = (&, ))

Linear RegressionObserve: Single variable XOutput: Real value Y (continuous output)

Model: Assume linear: !" = $ % = &' + )Training data: N training pairs * + , - + , * . , - . , …, * / , - /

Goal: !" = &' + ) where &, ) = argmin0,1 2 " − $ %.

3. Set derivatives to 0 and solve simultaneous equations:

4. Substitution: $ % = 4 ', " 5657 ' − 89 + 8:5. Estimate parameters based on observed training data:

!" = $(' = *) = =4 ', " >?:>?9* − @' + @"

21

& = −A 9: BA 9 A :A 9C B A 9 C =Cov 9,:Var 9 = 4 ', "

5657

) = 2 " − &2 ' = 8: − 4 ', " 5657 89

ClassificationObserve: Discrete observations ! =< $%, $', … , $) >Output: Discrete “class”

Model: +, = - ! = argmax.

+/ ,|!

Note (via Bayes): +, = argmax.

+/ !, , = argmax.

+/ !|, +/ ,

22!

[1, 1, 0]

+/ !, ,

+/ !, , = 1≈ 0.21

+/ !, , = 0≈ 0.14

argmax.

classifier +, =?

Let Xi, Y be indicator variables of liking a franchise.

+, = 1

ClassificationObserve: Discrete variable ! =< $%, $', … , $) >Output: Discrete output Y


+/ ,|$ = argmax.

+/ !, ,

Training data: N training pairs 1 % , 2 % , 1 ' , 2 ' , …, 1 3 , 2 3

What is our best estimate of +/ !, , ?• Recall: given a sample of size n, 4567 maximizes likelihood of seeing that particular sample.

Each $8 in sample is IID.In our training data:

• Each 1 8 , 2 8 is a RV in our sample. 1 8 , 2 8 ’s are IID.• / !, , can be modeled as a multinomial distribution!• # of outcomes: total # of probabilities in /(!, ,) space

23

What is 4567 that maximizes our likelihood of seeing the training data?What is 45;< that maximizes 4 given the training data and an assumed prior?

ClassificationTraining data: N training pairs ! " , $ " , ! % , $ % , …, ! & , $ & ?• What is '()* that maximizes our likelihood of seeing the training data?

• What is '(+, that maximizes ' given the training data and a Laplace prior?

Train:1. MLE: Count # of times each pair (x,y) appears in training data2. MAP using Laplace prior: add 1 to all the MLE counts3. Normalize to get true distribution (sum to 1)

Test/Predict:• Given a new data point X, predict -. = argmax

0-1 2, .

24

'()* = argmax3 4 ' = argmax3 567"

89 : 6 , .(6) ' Model: 2, . is multinomial

'(+, = argmax3 = ' 567"

89 : 6 , .(6) '

Model: 2, . is multinomial

Prior: Dirichlet with 1 imaginary outcomefor each unique (: 6 , .(6)) in training data

!

Training for catsObserve: X in {1,2,3,4}• X denotes location of photo:

bedroom (1), bathroom (2), patio (3), fridge (4)Output: Y from {0, 1}• Y denotes if cat in photo (1) or not (0)1. What are MLE estimates of "̂#,% &, ' ?2. What are Laplace estimates of "̂#,% &, ' ?

25

XY 1 2 3 4

no cat 3 7 10 20

cat 5 3 2 0

Train data: 50 points

"̂()* =count in cell

total # data points"̂),-.,/0 =

count in cell + 1total # data points + total # cells

XY 1 2 3 4 pY(y)

no cat 0.06 0.14 0.20 0.40 0.80

cat 0.10 0.06 0.04 0.00 0.20

pX(x) 0.16 0.20 0.24 0.40 1.00

MLE estimateX

Y 1 2 3 4 pY(y)

no cat 0.069 0.138 0.190 0.362 0.759

cat 0.103 0.069 0.052 0.017 0.241

pX(x) 0.172 0.207 0.242 0.379 1.00

Laplace (MAP) estimate

12 = argmax3

14 5, 2

!

Testing for catsObserve: X in {1,2,3,4} Location of photo: bedroom (1), bathroom (2), patio (3), fridge (4)Output: Y from {0, 1}: Cat in photo (1) or not (0)

You are now shown a new photo of a fridge, so X = 4.1. What is your prediction !" for whether there is a cat (or not) in the photo?2. Given X = 4, what is probability of having a cat in the photo?

26

MLE estimate Laplace (MAP) estimateX

Y 1 2 3 4 pY(y)

no cat 0.06 0.14 0.20 0.40 0.80

cat 0.10 0.06 0.04 0.00 0.20

pX(x) 0.16 0.20 0.24 0.40 1.00

XY 1 2 3 4 pY(y)

no cat 0.069 0.138 0.190 0.362 0.759

cat 0.103 0.069 0.052 0.017 0.241

pX(x) 0.172 0.207 0.242 0.379 1.00

!" = argmax$

!% &, "

1. Predict: !" = 0 (no cat)2. Absolutely, positively no chance of cat

1. Predict: !" = 0 (no cat)2. Small chance à “never say never”

!

The issue with multiple observablesObserve: Discrete variable ! =< $%, $', … , $) > each with 2 outcomesPredict: Discrete output Y with 2 outcomes


+/ ,|! = argmax.

+/ !, ,

How large is your lookup table +/ !, , if:1. 1 = 3?2. 1 = 10?3. 1 = 10,000?

27

• Size of PMF table is exponential in m! (e.g., O(2m))• Need ridiculous amount of data for good probability estimates• Likely to have many 0’s in table (bad times)• Need to consider a simpler model.

26 7 22%8 7 22%8888 7 2

BreakAttendance: t inyurl.com/cs109summer2018

28

Naïve Bayes ClassifierObserve: Discrete variable ! =< $%, $', … , $) > each with 2 outcomesPredict: Discrete output Y with 2 outcomes

Objective: +, = - ! = argmax.

+/ ,|! = argmax.

+/ !, , = argmax.

+/ !|, +/ ,

Naïve Bayes assumption:• Assume $%, $', … , $) conditionally independent given Y.• You don’t really believe this…• But it is an approximation we make to be able to make predictions

Therefore: +, = argmax.

+/ !|, +/ , = argmax.

∏23%) +/ $2|, +/ ,• Computation of PMF table is now linear in m: O(m)• Don’t need as much data to get good probability estimates

29

!

Naïve Bayes ClassificationObjective: !" = $ % = argmax

&!' "|%

= argmax&

!' %, " = argmax&

∏+,-. !' /+|" !' "Train1. Compute !' /+|" = 0 1 = 1,… ,4:

a. Bernoulli MLE (# times /+ = 0, " = 0 vs /+ = 1, " = 0 )b. Laplace estimate: +1 to MLE estimatesc. Normalize to get true distribution (sum to 1)

2. Compute !' " :a. Bernoulli MLE (# times " = 0 vs " = 1)b. Normalize by # samples to get true distribution (sum to 1)

Classify new % =< /- = 7-, /8 = 78,… , /. = 7. >1. Compute !' %, " = 0 = ∏+,-. !' /+ = 7+|" = 0 !' " = 0

for " = 0 and " = 12. Return the label (“class”) that led to higher probability

30

(Note: no Laplace smoothing for class probs.)

!

Naïve Bayes + MLE for moviesObserve: indicator variables !", !$• !": “likes Star Wars”• !$: “likes Harry Potter”Output: Y indicator variable:• %: “likes Pokemon”

31

!" = 1Likes Star Wars

!$ = 1Likes Harry Potter

% = 1Likes Pokemon

!

Naïve Bayes + MLE for moviesObserve: indicator variables !", !$• !": “likes Star Wars”• !$: “likes Harry Potter”Output: Y indicator variable:• %: “likes Pokemon”Train:1. What are MLE estimates of

'̂()|+ ,-|. ?2. What are MLE estimates of

'̂+ . ?Classify/Predict/Test:3. New person: likes Star Wars

(!" = 1) but not Harry Potter (!$ = 0). Will they like Pokemon? 32

X1Y 0 1

0 3 10

1 4 13

Train data: n=30 points

X2Y 0 1

0 5 8

1 7 10

X1Y 0 1

0 0.23 0.77

1 0.24 0.76

'̂(4|+ ,"|. =# !" = ,", % = .

# % = . '̂(5|+ ,$|. =# !$ = ,$, % = .

# % = .

Y #

0 13

1 17

X2Y 0 1

0 0.38 0.62

1 0.41 0.59

Y '̂+ .0 0.43

1 0.57'̂+ . =

# % = .6

!

Naïve Bayes + MLE for moviesObserve: indicator variables !", !$• !": “likes Star Wars”• !$: “likes Harry Potter”Output: Y indicator variable:• %: “likes Pokemon”Train:1. What are MLE estimates of


'̂+ . ?Classify/Predict/Test :3. New person: likes Star Wars

(!" = 1) but not Harry Potter (!$ = 0). Will they like Pokemon? 33

X1Y 0 1

0 0.23 0.77

1 0.24 0.76

'̂(4|+ ,"|. '̂(5|+ ,$|.X2

Y 0 1

0 0.38 0.62

1 0.41 0.59

Y '̂+ .0 0.43

1 0.57

'̂+ .

67 8, % = 0 = 67 !" = 1|% = 0 67 !$ = 0|% = 0 67 % = 0= '̂(4|+ 1|0 '̂(5|+ 0|0 '̂+ 0= 0.77 ; 0.38 ; 0.43 ≈ 0.13

67 8, % = 1 = 67 !" = 1|% = 1 67 !$ = 0|% = 1 67 % = 1= '̂(4|+ 1|1 '̂(5|+ 0|1 '̂+ 1= 0.76 ; 0.41 ; 0.57 ≈ 0.18 larger

smaller

6% = 1: predict “like Pokemon”

!

Naïve Bayes + Laplace for moviesObserve: indicator variables !", !$• !": “likes Star Wars”• !$: “likes Harry Potter”Output: Y indicator variable:• %: “likes Pokemon”Train:1. What are Laplace estimates of


'̂+ . ?

34

(note Laplace not used for prior)

X1Y 0 1

0 0.27 0.73

1 0.26 0.74

'̂(0|+ ,"|. =# !" = ,", % = . + 1

# % = . + 2 '̂(5|+ ,$|. =# !$ = ,$, % = . + 1

# % = . + 2

X2Y 0 1

0 0.40 0.60

1 0.42 0.58

X1Y 0 1

0 3 10

1 4 13

Train data: n=30 points

X2Y 0 1

0 5 8

1 7 10

Y #

0 13

1 17

Y '̂+ .0 0.43

1 0.57

'̂+ . =# % = .

6

Summary so farObserve: Discrete variable ! =< $%, $', … , $) > each with 2 outcomesOutput: Discrete output + with 2 outcomesClassification:

,+ = - ! = argmax.

,/ +|! = argmax.

,/ !, + = argmax.

,/ !|+ ,/ +

Naïve Bayes assumption:• Assume $1’s conditionally independent given +• Reduce the probabilities we need to compute from O(2m) to O(m)• Training: Use MLE or Laplace estimate for ,/ $1|+ , MLE estimate for ,/ +• Testing: Compute:

35

(by definition) (equivalent)(equivalent)

= argmax.

213%

),/ $1 = 41|+ = 5 ,/ + = 5,+ = argmax.

6/ +|!

!

Log-probabilities

What if m is large and !" #$|& is small?Solution: Take logs:

36

!& = argmax(

!" &|) = argmax(

*" & = + ,$-.

/*" #$ = 0$|& = +

Underflow!

!& = argmax(

!" &|) = argmax(

log !" &|)

= argmax(

log !" & = + ,$-.

4!" #$ = 0$|& = +

= argmax(

log !" & = + +6$-.

4log !" #$ = 0$|& = + Works well for high-dimensional data

Bags of words as a multinomialExample document:“Pay for Viagra with a credit card. Viagra is great. So are credit-cards. Risk-free Viagra. Click for free.”

n = 18

37

Viagra = 2Free = 2Risk = 1Credit-card: 2…For = 2

P

✓1

2|spam

◆=

n!

2!2! . . . 2!p2viagrap

2free . . . p

2forP

✓1

2|spam

◆=

n!

2!2! . . . 2!p2viagrap

2free . . . p

2forP

✓1

2|spam

◆=

n!

2!2! . . . 2!p2viagrap

2free . . . p

2for

Probability of (this document | spam writer)

Multinomial

Note P( viagra | spam writer) >>P( viagra | writer = you)

Bag of Words with Naïve Bayes

Goal: Predict if an email is spam or not.Known: Lexicon of m words (e.g., English language m ≈ 100,000)

Observe: m indicator variables ! =< $%, $', … , $) >$+ denotes if word i appeared in an email or not

Output: , indicator variable denoting if email is spam or not

Preprocess: N previous emails with the following info per email:• ! =< $%, $', … , $) > noting for each word if it appeared in this email• , label marking this email as spam or not

38

-, = argmax.

log -2 , = 3 +5+6%

7log -2 $+ = 8+|, = 3

(no longer counts)

Spam Classifier

Training

1. Go through all training emails and update all counts:# !" = 1, & = spam , # !" = 1, & = not spam# !" = 0, & = spam , # !" = 0, & = not spam

2. Since many words (in the lexicon of 100,000 words)are likely to not appear at all in the training emails,

Use Laplace estimate:

Classifying• Classify new email:

• Employ Naïve Bayes assumption:

39

() !* = 1|& = ,-./ =# !* = 0*, & = 1 + 1

# & = 1 + 2

Sanity check: # !" = 1, & = spam + # !" = 0, & = spam = # & = spam

(and so on)

(& = argmax4

log () & = 1 +89:*

;log () !9 = 09|& = 1

(& = argmax4

() &|< = argmax4

()

Checking performanceAfter training, we can test with another set of data.Test data: also has features ! =< $%, $', … , $) > and known + values

Can see how often we were right/wrong in predictions for Y.Metrics:• Precision: (# correctly predicted class Y)/(# predicted class Y)• Recall: (# correctly predicted class Y)/(# real class Y messages)• Accuracy: (# correct across all classes)/(total test emails)

40

Spam Non-spam

Precision Recall Precision Recall

Words only 97.1% 94.3% 87.7% 93.4%

Spam data:• Email dataset: 1789 emails

(1578 spam, 211 non-spam)• Randomly choose 1538 emails for training• Remaining 251 emails for testing learned

classifier

Congratulations!

You’ve just learned Machine Learning in a nutshell.

1. Have some training data.

2. Assume some relationship betweenthe features and labels.

3. Get estimates of relationshipfrom training data.

4. Use estimates to predict on test datausing assumptions.

41

"̂#$|& '(|)"̂& )

* =< -., -0, … , -2 >4

featureslabels

54 = argmax)

log 59 4 = ) +;

log 59 -< = '

lecture 20: naïve bayes - stanford university...the issue with maximum likelihood...

Documents