lecture 20: naïve bayes - stanford university...the issue with maximum likelihood...

41
Lecture 20: Naïve Bayes Lisa Yan August 10, 2018

Upload: others

Post on 27-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Lecture 20:Naïve BayesLisa YanAugust 10, 2018

  • Announcements

    Final Exam Review:◦ Survey for concepts/questions (please fill out by Monday 8/13):

    https://goo.gl/forms/UPsGbff3FY5MU5pf2Note: Last OH on Wednesday 8/15 (None on Thursday 8/16)Problem Set #6 coding question: covered today

    2

    https://goo.gl/forms/UPsGbff3FY5MU5pf2

  • Summary from last timeParameter estimation:

    Given a sample of size n of IID RVs drawn from the same distribution:

    • Find a “point estimate” of model parameters !• (useful for simulating new samples later on)• Intuitively, this should be the parameters that make the observed sample

    the most likely.

    • Joint likelihood of seeing the particular sample "#, "%, … , "':

    3

    ( ! = * "#, "%, … , "' ! =+,-#

    '* ", ! Likelihood function

    Conditional densities are a functionof the chosen parameter !

  • Maximum Likelihood Estimator

    Consider the sample of n IID RVs, !", , … , !%, where !& has density function ' !& (

    Maximum Likelihood Estimator (MLE):

    ()*+ = arg max- . ( = arg max- .. ( = arg max- /&0"

    %log ' !& (

    1. Determine formula for log-likelihood .. ( based on ' !& ( .2. Differentiate .. ( and set to 0.3. Solve simultaneous equations for ()*+.

    4

    . ( = ' !", !4, … , !% ( =5&0"

    %' !& ( Likelihood function

  • MLE with BernoulliConsider the sample of n IID RVs, !", !$, … , !&Assume model: !'~Ber )PMF * !' + = )-. 1 − ) "1-., where 2' = 0 or 2' = 1What is +456?

    5

    1. 77 +

    2. Differentiate w.r.t pand set to 0:

    3. Solve.

    +456 = arg max8 77 + = arg max8 9':"

    &log * !' +

    Solution:=9

    ':"

    &log )>. 1 − ) "1>. =9

    ':"

    &!' log ) + (1 − !') log 1 − )

    = B log ) + C − B log 1 − )D77 )D) = B

    1) + C − B

    −11 − ) = 0

    )456 =BC =

    1C9':"

    &!'

    (variable substitution)where B = ∑':"& !'

  • The issue with Maximum Likelihood Estimators!"#$ = arg max& ' !

    The parameter ! making observed sample the most likely is not always the “best” !:

    Consider a coin.Experiment: Flip n = 5 times

    Observe 5 heads

    • This is very unbelievable, because our prior knowledge leads us to think that there is some possibility of flipping tails.

    • But MLE says we can only use the data to estimate !.

    6

    ("#$ =# of successes

    n= 55 = 1

    ' ("#$ = 1+ = 1

    Solution: Maximum a Posteriori Estimators

  • Maximum A Posteriori Estimator

    Maximum A Posteriori (MAP) Estimator:

    !"#$ = arg max& ' ! (), (+, … , (- = arg max& . ! /01)

    -' (0 !

    If conjugate distribution known:

    1. Assume imaginary trials

    2. Update posterior with real experiment

    3. Report mode of posterior

    Laplace smoothing – Assume you have seen 1 imaginary trial per outcome.

    7

  • Conjugate distributions• When using MAP estimators,

    we have a choice of prior.

    • But we need to know thedistribution of our posteriorin order to find its mode (!"#$)

    Conjugate distributions:

    • For a given model (e.g., Bernoulli/Multinomial/Poisson)• If we choose the conjugate distribution for the prior distribution of the

    parameter ! (e.g., Beta/Dirichlet/Gamma),• We know the posterior distribution is the same (e.g., Beta/Dirichlet/Gamma),• And therefore we can find the mode of the posterior.

    8

    Prior: % !Likelihood: ∏'()* + ,' !Posterior: + ! ,), ,., … , ,*!"#$: arg max0 + ! ,), ,., … , ,*

  • Multinomial, now with MAP + Laplace SmoothingInterpreting the Multinomial MAP + Laplace:

    Consider a 6-sided die.Let !~Mul@ #$, #&, #', #(, #), #*Experiment: Roll n = 12 times

    Observe 3 ones, 2 twos, 0 threes, 3 fours, 1 five, 3 sixes

    What is Laplace estimator for #+ for , = 1, 2, … , 6?

    9

    #̂$ = ($3 , #̂& ='$3 , #̂' =

    $$3 , #̂( =

    ($3 , #̂) =

    &$3 , #̂* =

    ($3

    No longer have 0 probability of rolling a 3!

    #̂+ =(# @mes outcome k occurs)+1

    4 +6

    Solution:

    #$ = '$& , #& =&$& , #' =

    7$& , #( =

    '$& , #) =

    $$& , #* =

    '$&Recall MLE:

  • MAP for PoissonConjugate distribution for Poisson is Gamma(α, #).

    Prior: Gamma(α, #)Prior: saw α total imaginary events during # prior time periods

    Experiment: Observe $ new events during next % time periods

    Posterior: Gamma(α + $, # + %)

    MAP estimator:

    10

    '()* = arg max, - ' ./01

    23 4/ '

    '()* = 5# =α + $# + %

    (if we choosethis prior,)

    (for this model,)

    (we will get thisposterior,)

    (for which themode is known.)

  • !

    Poisson with MAPConsider a Poisson.

    Imaginary experiment: Measure 5 time periodsObserve 10 events total

    Real experiment: Measure 2 time periodsObserve 11 events.

    What is MAP estimator for !"?

    11

    !" = 10 + 112 + 5 =217 = 3

    +,-. = !" =α + 0" + 1

    Solution:

  • Summary of MLE/MAP

    Bernoulli

    "̂ = # of successesn "̂ =(# of successes)+1

    n+2Binomial

    "̂ = # of successesn "̂ =(# of successes)+1

    n+2Multinomial

    "̂) = (# :mes outcome k occurs)n "̂) =(# :mes outcome k occurs)+1

    n+m

    12

    Maximum LikelihoodEstimator (MLE), *+,-

    Maximum A Posteriori (MAP)with Laplace Smoothing, *+./

    Caveat for MLE/MAP: need independence of trials!

  • Summary of MLE/MAP

    Poisson

    !" = # of successesn !" =(# of successes)+(imag. successes)

    n+(imag. intervals)Normal

    (̂ = )* ∑,-)* ., , (not discussed)

    012 = )*∑,-)* ., − (̂ 2

    Uniform

    04 = min .), .2, … , .9 , (not discussed)!: = max .), .2, … , .9

    13

    Maximum LikelihoodEstimator (MLE), =>?@

    Maximum A Posteriori(MAP), =>AB

    Additional MLE caveat: suffers from small samples

  • Goals for today

    Machine Learning◦ Introduction◦ Linear Regression◦Brute force Classification◦Naïve Bayes◦ Evaluation tools: Precision, Recall

    14

    ”Started from the bottomnow we’re here” – Drake, 2013

    The end of CS109

    Counting

    hereMachine Learning

  • Sampling use cases

    15

    Model of X

    Sample of population

    Bootstrapping

    Estimates of E[X] or Var(X),

    confidence interval, p-values

    ParameterEstimation

    Sample withfeatures and labels

    Predict labelson new data

    Machine Learning

  • What is Machine Learning?Many different forms, but we focus on the problem of prediction.Sample data is now:• Features ! =< $%, $', … , $) >• Label +Goal: Learn a function ,+ = - ! to predict +1. Training: observe relationship from pre-existing data

    2. Prediction:

    16

    Model (with parameters),- !

    ! ,+

    Known stock patterns:• !: economic indicators for stock• +: stock’s price tomorrow

    On a new, unseen stock:• !: economic indicators for this stock• ,+: stock’s price tomorrow?

    Known medical records:• !: clinical/demographic data• +: has cancer by age 40

    On a new person:• !: clinical/demographic data• ,+: cancer by age 40?

    Past purchases on credit card:• !: purchase details• +: fraud or not

    A new purchase:• !: purchase details• ,+: fraud?

    + continuous + discrete + discrete

  • Spam, Spam… Go Away!

    17

    “And machine-learning algorithms developed to merge and rank large sets of Google search results allow us to combine hundreds of factors to classify spam.”

    https://www.google.com/mail/help/intl/en-GB/fightspam/spamexplained.html

    https://www.google.com/mail/help/intl/en-GB/fightspam/spamexplained.html

  • What is Machine Learning?Many different forms, but we focus on the problem of prediction.Sample data is now:• Features ! =< $%, $', … , $) >• Label +Goal: Learn a function ,+ = - ! to predict +1. Training: observe relationship from pre-existing data

    2. Prediction: Y discrete: prediction called “classification”Y continuous: prediction called “regression”

    18

    Model (with parameters),- !

    ! ,+

    !(%) =< $% = 1, $' = 3,… , $) = 3.5 >, +(%) = 1!(') =< $% = 0, $' = 1,… , $) = −1.9 >, +(') = 0

    …!(7) =< $% = 1, $' = 20,… , $) = 1.2 >, +(7) = 0

    Sample size N

    M-dimensional IID RV (feature) 1-D dependent (label)

    - !

  • How do we find ! " ?

    Mathematically:

    1. Determine a parametric form for #$ = ! " with parameters &Train on data: ' ( , * ( , ' + , * + , …, ' , , * ,2. Estimate the best parameters & according to an objective function

    Y continuous: minimize mean squared error MSE “loss”, - $ − ! " +

    Y discrete: maximize likelihood of Y given features ", ! " = argmax/

    #0 $|2

    Prediction on test data: Given a new input ", use ! " to predict the “class” #$

    19

    Model (with parameters),! "

    " #$

  • Linear RegressionObserve: Single variable XOutput: Real value Y (continuous output)

    Model: Assume linear: !" = $ % = &' + )Training data: N training pairs * + , - + , * . , - . , …, * / , - /

    Goal: !" = &' + ) where &, ) = argmin0,1 2 " − $ %.

    1. 2 " − $ % . = 2 " − &' + ) . = 2 " − &' − ) .2. Compute derivatives w.r.t & and ):440 2 " − &' − )

    . = 2 −2'(" − &' − )) = −22 '" + 2&2 '. + 2)2 '441 2 " − &' − )

    . = 2 −2 " − &' − ) = −22 " + 2&2 ' + 2)

    20

    Superscripts denotek-th training instance

    (as a warmup)

    parameters 8 = (&, ))

  • Linear RegressionObserve: Single variable XOutput: Real value Y (continuous output)

    Model: Assume linear: !" = $ % = &' + )Training data: N training pairs * + , - + , * . , - . , …, * / , - /

    Goal: !" = &' + ) where &, ) = argmin0,1 2 " − $ %.

    3. Set derivatives to 0 and solve simultaneous equations:

    4. Substitution: $ % = 4 ', " 5657 ' − 89 + 8:5. Estimate parameters based on observed training data:

    !" = $(' = *) = =4 ', " >?:>?9* − @' + @"

    21

    & = −A 9: BA 9 A :A 9C B A 9 C =Cov 9,:Var 9 = 4 ', "

    5657

    ) = 2 " − &2 ' = 8: − 4 ', " 5657 89

  • ClassificationObserve: Discrete observations ! =< $%, $', … , $) >Output: Discrete “class”

    Model: +, = - ! = argmax.

    +/ ,|!

    Note (via Bayes): +, = argmax.

    +/ !, , = argmax.

    +/ !|, +/ ,

    22!

    [1, 1, 0]

    +/ !, ,

    +/ !, , = 1≈ 0.21

    +/ !, , = 0≈ 0.14

    argmax.

    classifier +, =?

    Let Xi, Y be indicator variables of liking a franchise.

    +, = 1

  • ClassificationObserve: Discrete variable ! =< $%, $', … , $) >Output: Discrete output Y

    Model: +, = - ! = argmax.

    +/ ,|$ = argmax.

    +/ !, ,

    Training data: N training pairs 1 % , 2 % , 1 ' , 2 ' , …, 1 3 , 2 3

    What is our best estimate of +/ !, , ?• Recall: given a sample of size n, 4567 maximizes likelihood of seeing that particular sample.

    Each $8 in sample is IID.In our training data:

    • Each 1 8 , 2 8 is a RV in our sample. 1 8 , 2 8 ’s are IID.• / !, , can be modeled as a multinomial distribution!• # of outcomes: total # of probabilities in /(!, ,) space

    23

    What is 4567 that maximizes our likelihood of seeing the training data?What is 45;< that maximizes 4 given the training data and an assumed prior?

  • ClassificationTraining data: N training pairs ! " , $ " , ! % , $ % , …, ! & , $ & ?• What is '()* that maximizes our likelihood of seeing the training data?

    • What is '(+, that maximizes ' given the training data and a Laplace prior?

    Train:1. MLE: Count # of times each pair (x,y) appears in training data2. MAP using Laplace prior: add 1 to all the MLE counts3. Normalize to get true distribution (sum to 1)

    Test/Predict:• Given a new data point X, predict -. = argmax

    0-1 2, .

    24

    '()* = argmax3 4 ' = argmax3 567"

    89 : 6 , .(6) ' Model: 2, . is multinomial

    '(+, = argmax3 = ' 567"

    89 : 6 , .(6) '

    Model: 2, . is multinomial

    Prior: Dirichlet with 1 imaginary outcomefor each unique (: 6 , .(6)) in training data

  • !

    Training for catsObserve: X in {1,2,3,4}• X denotes location of photo:

    bedroom (1), bathroom (2), patio (3), fridge (4)Output: Y from {0, 1}• Y denotes if cat in photo (1) or not (0)1. What are MLE estimates of "̂#,% &, ' ?2. What are Laplace estimates of "̂#,% &, ' ?

    25

    XY 1 2 3 4

    no cat 3 7 10 20

    cat 5 3 2 0

    Train data: 50 points

    "̂()* =count in cell

    total # data points"̂),-.,/0 =

    count in cell + 1total # data points + total # cells

    XY 1 2 3 4 pY(y)

    no cat 0.06 0.14 0.20 0.40 0.80

    cat 0.10 0.06 0.04 0.00 0.20

    pX(x) 0.16 0.20 0.24 0.40 1.00

    MLE estimateX

    Y 1 2 3 4 pY(y)

    no cat 0.069 0.138 0.190 0.362 0.759

    cat 0.103 0.069 0.052 0.017 0.241

    pX(x) 0.172 0.207 0.242 0.379 1.00

    Laplace (MAP) estimate

    12 = argmax3

    14 5, 2

  • !

    Testing for catsObserve: X in {1,2,3,4} Location of photo: bedroom (1), bathroom (2), patio (3), fridge (4)Output: Y from {0, 1}: Cat in photo (1) or not (0)

    You are now shown a new photo of a fridge, so X = 4.1. What is your prediction !" for whether there is a cat (or not) in the photo?2. Given X = 4, what is probability of having a cat in the photo?

    26

    MLE estimate Laplace (MAP) estimateX

    Y 1 2 3 4 pY(y)

    no cat 0.06 0.14 0.20 0.40 0.80

    cat 0.10 0.06 0.04 0.00 0.20

    pX(x) 0.16 0.20 0.24 0.40 1.00

    XY 1 2 3 4 pY(y)

    no cat 0.069 0.138 0.190 0.362 0.759

    cat 0.103 0.069 0.052 0.017 0.241

    pX(x) 0.172 0.207 0.242 0.379 1.00

    !" = argmax$

    !% &, "

    1. Predict: !" = 0 (no cat)2. Absolutely, positively no chance of cat

    1. Predict: !" = 0 (no cat)2. Small chance à “never say never”

  • !

    The issue with multiple observablesObserve: Discrete variable ! =< $%, $', … , $) > each with 2 outcomesPredict: Discrete output Y with 2 outcomes

    Model: +, = - ! = argmax.

    +/ ,|! = argmax.

    +/ !, ,

    How large is your lookup table +/ !, , if:1. 1 = 3?2. 1 = 10?3. 1 = 10,000?

    27

    • Size of PMF table is exponential in m! (e.g., O(2m))• Need ridiculous amount of data for good probability estimates• Likely to have many 0’s in table (bad times)• Need to consider a simpler model.

    26 7 22%8 7 22%8888 7 2

  • BreakAttendance: t inyurl.com/cs109summer2018

    28

  • Naïve Bayes ClassifierObserve: Discrete variable ! =< $%, $', … , $) > each with 2 outcomesPredict: Discrete output Y with 2 outcomes

    Objective: +, = - ! = argmax.

    +/ ,|! = argmax.

    +/ !, , = argmax.

    +/ !|, +/ ,

    Naïve Bayes assumption:• Assume $%, $', … , $) conditionally independent given Y.• You don’t really believe this…• But it is an approximation we make to be able to make predictions

    Therefore: +, = argmax.

    +/ !|, +/ , = argmax.

    ∏23%) +/ $2|, +/ ,• Computation of PMF table is now linear in m: O(m)• Don’t need as much data to get good probability estimates

    29

  • !

    Naïve Bayes ClassificationObjective: !" = $ % = argmax

    &!' "|%

    = argmax&

    !' %, " = argmax&

    ∏+,-. !' /+|" !' "Train1. Compute !' /+|" = 0 1 = 1,… ,4:

    a. Bernoulli MLE (# times /+ = 0, " = 0 vs /+ = 1, " = 0 )b. Laplace estimate: +1 to MLE estimatesc. Normalize to get true distribution (sum to 1)

    2. Compute !' " :a. Bernoulli MLE (# times " = 0 vs " = 1)b. Normalize by # samples to get true distribution (sum to 1)

    Classify new % =< /- = 7-, /8 = 78,… , /. = 7. >1. Compute !' %, " = 0 = ∏+,-. !' /+ = 7+|" = 0 !' " = 0

    for " = 0 and " = 12. Return the label (“class”) that led to higher probability

    30

    (Note: no Laplace smoothing for class probs.)

  • !

    Naïve Bayes + MLE for moviesObserve: indicator variables !", !$• !": “likes Star Wars”• !$: “likes Harry Potter”Output: Y indicator variable:• %: “likes Pokemon”

    31

    !" = 1Likes Star Wars

    !$ = 1Likes Harry Potter

    % = 1Likes Pokemon

  • !

    Naïve Bayes + MLE for moviesObserve: indicator variables !", !$• !": “likes Star Wars”• !$: “likes Harry Potter”Output: Y indicator variable:• %: “likes Pokemon”Train:1. What are MLE estimates of

    '̂()|+ ,-|. ?2. What are MLE estimates of

    '̂+ . ?Classify/Predict/Test:3. New person: likes Star Wars

    (!" = 1) but not Harry Potter (!$ = 0). Will they like Pokemon? 32

    X1Y 0 1

    0 3 10

    1 4 13

    Train data: n=30 points

    X2Y 0 1

    0 5 8

    1 7 10

    X1Y 0 1

    0 0.23 0.77

    1 0.24 0.76

    '̂(4|+ ,"|. =# !" = ,", % = .

    # % = . '̂(5|+ ,$|. =# !$ = ,$, % = .

    # % = .

    Y #

    0 13

    1 17

    X2Y 0 1

    0 0.38 0.62

    1 0.41 0.59

    Y '̂+ .0 0.43

    1 0.57'̂+ . =

    # % = .6

  • !

    Naïve Bayes + MLE for moviesObserve: indicator variables !", !$• !": “likes Star Wars”• !$: “likes Harry Potter”Output: Y indicator variable:• %: “likes Pokemon”Train:1. What are MLE estimates of

    '̂()|+ ,-|. ?2. What are MLE estimates of

    '̂+ . ?Classify/Predict/Test :3. New person: likes Star Wars

    (!" = 1) but not Harry Potter (!$ = 0). Will they like Pokemon? 33

    X1Y 0 1

    0 0.23 0.77

    1 0.24 0.76

    '̂(4|+ ,"|. '̂(5|+ ,$|.X2

    Y 0 1

    0 0.38 0.62

    1 0.41 0.59

    Y '̂+ .0 0.43

    1 0.57

    '̂+ .

    67 8, % = 0 = 67 !" = 1|% = 0 67 !$ = 0|% = 0 67 % = 0= '̂(4|+ 1|0 '̂(5|+ 0|0 '̂+ 0= 0.77 ; 0.38 ; 0.43 ≈ 0.13

    67 8, % = 1 = 67 !" = 1|% = 1 67 !$ = 0|% = 1 67 % = 1= '̂(4|+ 1|1 '̂(5|+ 0|1 '̂+ 1= 0.76 ; 0.41 ; 0.57 ≈ 0.18 larger

    smaller

    6% = 1: predict “like Pokemon”

  • !

    Naïve Bayes + Laplace for moviesObserve: indicator variables !", !$• !": “likes Star Wars”• !$: “likes Harry Potter”Output: Y indicator variable:• %: “likes Pokemon”Train:1. What are Laplace estimates of

    '̂()|+ ,-|. ?2. What are MLE estimates of

    '̂+ . ?

    34

    (note Laplace not used for prior)

    X1Y 0 1

    0 0.27 0.73

    1 0.26 0.74

    '̂(0|+ ,"|. =# !" = ,", % = . + 1

    # % = . + 2 '̂(5|+ ,$|. =# !$ = ,$, % = . + 1

    # % = . + 2

    X2Y 0 1

    0 0.40 0.60

    1 0.42 0.58

    X1Y 0 1

    0 3 10

    1 4 13

    Train data: n=30 points

    X2Y 0 1

    0 5 8

    1 7 10

    Y #

    0 13

    1 17

    Y '̂+ .0 0.43

    1 0.57

    '̂+ . =# % = .

    6

  • Summary so farObserve: Discrete variable ! =< $%, $', … , $) > each with 2 outcomesOutput: Discrete output + with 2 outcomesClassification:

    ,+ = - ! = argmax.

    ,/ +|! = argmax.

    ,/ !, + = argmax.

    ,/ !|+ ,/ +

    Naïve Bayes assumption:• Assume $1’s conditionally independent given +• Reduce the probabilities we need to compute from O(2m) to O(m)• Training: Use MLE or Laplace estimate for ,/ $1|+ , MLE estimate for ,/ +• Testing: Compute:

    35

    (by definition) (equivalent)(equivalent)

    = argmax.

    213%

    ),/ $1 = 41|+ = 5 ,/ + = 5,+ = argmax.

    6/ +|!

  • !

    Log-probabilities

    What if m is large and !" #$|& is small?Solution: Take logs:

    36

    !& = argmax(

    !" &|) = argmax(

    *" & = + ,$-.

    /*" #$ = 0$|& = +

    Underflow!

    !& = argmax(

    !" &|) = argmax(

    log !" &|)

    = argmax(

    log !" & = + ,$-.

    4!" #$ = 0$|& = +

    = argmax(

    log !" & = + +6$-.

    4log !" #$ = 0$|& = + Works well for high-dimensional data

  • Bags of words as a multinomialExample document:“Pay for Viagra with a credit card. Viagra is great. So are credit-cards. Risk-free Viagra. Click for free.”

    n = 18

    37

    Viagra = 2Free = 2Risk = 1Credit-card: 2…For = 2

    P

    ✓1

    2|spam

    ◆=

    n!

    2!2! . . . 2!p2viagrap

    2free . . . p

    2forP

    ✓1

    2|spam

    ◆=

    n!

    2!2! . . . 2!p2viagrap

    2free . . . p

    2forP

    ✓1

    2|spam

    ◆=

    n!

    2!2! . . . 2!p2viagrap

    2free . . . p

    2for

    Probability of (this document | spam writer)

    Multinomial

    Note P( viagra | spam writer) >>P( viagra | writer = you)

  • Bag of Words with Naïve Bayes

    Goal: Predict if an email is spam or not.Known: Lexicon of m words (e.g., English language m ≈ 100,000)

    Observe: m indicator variables ! =< $%, $', … , $) >$+ denotes if word i appeared in an email or not

    Output: , indicator variable denoting if email is spam or not

    Preprocess: N previous emails with the following info per email:• ! =< $%, $', … , $) > noting for each word if it appeared in this email• , label marking this email as spam or not

    38

    -, = argmax.

    log -2 , = 3 +5+6%

    7log -2 $+ = 8+|, = 3

    (no longer counts)

  • Spam Classifier

    Training

    1. Go through all training emails and update all counts:# !" = 1, & = spam , # !" = 1, & = not spam# !" = 0, & = spam , # !" = 0, & = not spam

    2. Since many words (in the lexicon of 100,000 words)are likely to not appear at all in the training emails,

    Use Laplace estimate:

    Classifying• Classify new email:

    • Employ Naïve Bayes assumption:

    39

    () !* = 1|& = ,-./ =# !* = 0*, & = 1 + 1

    # & = 1 + 2

    Sanity check: # !" = 1, & = spam + # !" = 0, & = spam = # & = spam

    (and so on)

    (& = argmax4

    log () & = 1 +89:*

    ;log () !9 = 09|& = 1

    (& = argmax4

    () &|< = argmax4

    ()

  • Checking performanceAfter training, we can test with another set of data.Test data: also has features ! =< $%, $', … , $) > and known + values

    Can see how often we were right/wrong in predictions for Y.Metrics:• Precision: (# correctly predicted class Y)/(# predicted class Y)• Recall: (# correctly predicted class Y)/(# real class Y messages)• Accuracy: (# correct across all classes)/(total test emails)

    40

    Spam Non-spam

    Precision Recall Precision Recall

    Words only 97.1% 94.3% 87.7% 93.4%

    Spam data:• Email dataset: 1789 emails

    (1578 spam, 211 non-spam)• Randomly choose 1538 emails for training• Remaining 251 emails for testing learned

    classifier

  • Congratulations!

    You’ve just learned Machine Learning in a nutshell.

    1. Have some training data.

    2. Assume some relationship betweenthe features and labels.

    3. Get estimates of relationshipfrom training data.

    4. Use estimates to predict on test datausing assumptions.

    41

    "̂#$|& '(|)"̂& )

    * =< -., -0, … , -2 >4

    featureslabels

    54 = argmax)

    log 59 4 = ) +;

    log 59 -< = '