naïve bayes - university of california, berkeleycs188/su19/assets/... · naïve bayes for digits...

39
CS 188: Artificial Intelligence Naïve Bayes Instructors: Brijen Thananjeyan and Aditya Baradwaj --- University of California, Berkeley [These slides were createdby Dan Klein, Pieter Abbeel, Sergey Levine, with some materials from A. Farhadi. All CS188 materials are at http://ai.berkeley.edu.]

Upload: others

Post on 05-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

CS188:ArtificialIntelligenceNaïveBayes

Instructors:BrijenThananjeyanandAdityaBaradwaj --- UniversityofCalifornia,Berkeley[TheseslideswerecreatedbyDanKlein, PieterAbbeel,SergeyLevine,withsomematerialsfromA.Farhadi.AllCS188materialsareathttp://ai.berkeley.edu.]

Page 2: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

MachineLearning

§ Upuntilnow:howuseamodeltomakeoptimaldecisions

§ Machinelearning:howtoacquireamodelfromdata/experience§ Learningparameters (e.g.probabilities)§ Learningstructure(e.g.BNgraphs)§ Learninghiddenconcepts(e.g.clustering)

§ Today:model-basedclassificationwithNaiveBayes

Page 3: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

Classification

Page 4: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

Example:SpamFilter

§ Input:anemail§ Output:spam/ham

§ Setup:§ Getalargecollectionofexampleemails,eachlabeled

“spam”or“ham”§ Note:someonehastohandlabelallthisdata!§ Wanttolearntopredictlabelsofnew,futureemails

§ Features:Theattributesusedtomaketheham/spamdecision§ Words:FREE!§ TextPatterns:$dd,CAPS§ Non-text:SenderInContacts§ …

DearSir.

First,Imustsolicityourconfidenceinthistransaction,thisisbyvirtureofitsnatureasbeingutterlyconfidencialandtopsecret.…

TOBEREMOVEDFROMFUTUREMAILINGS,SIMPLYREPLYTOTHISMESSAGEANDPUT"REMOVE"INTHESUBJECT.

99MILLIONEMAILADDRESSESFORONLY$99

Ok,IknowthisisblatantlyOTbutI'mbeginningtogoinsane.HadanoldDellDimensionXPSsittinginthecorneranddecidedtoput ittouse,Iknowitwasworkingprebeingstuckinthecorner,butwhenIpluggeditin,hitthepowernothinghappened.

Page 5: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

Example:DigitRecognition

§ Input:images/pixelgrids§ Output:adigit0-9

§ Setup:§ Getalargecollectionofexampleimages,eachlabeledwithadigit§ Note:someonehastohandlabelallthisdata!§ Wanttolearntopredictlabelsofnew,futuredigitimages

§ Features:Theattributesusedtomakethedigitdecision§ Pixels:(6,8)=ON§ ShapePatterns:NumComponents,AspectRatio,NumLoops§ …

0

1

2

1

??

Page 6: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

OtherClassificationTasks

§ Classification:giveninputsx,predictlabels(classes)y

§ Examples:§ Spamdetection(input:document,

classes:spam/ham)§ OCR(input:images,classes:characters)§ Medicaldiagnosis(input:symptoms,

classes:diseases)§ Automaticessaygrading(input:document,

classes:grades)§ Frauddetection(input:accountactivity,

classes:fraud/nofraud)§ Customerserviceemailrouting§ …manymore

§ Classificationisanimportantcommercialtechnology!

Page 7: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

Model-BasedClassification

Page 8: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

Model-BasedClassification

§ Model-basedapproach§ Buildamodel(e.g.Bayes’net)whereboththelabelandfeaturesarerandomvariables

§ Instantiateanyobservedfeatures§ Queryforthedistributionofthelabelconditionedonthefeatures

§ Challenges§ WhatstructureshouldtheBNhave?§ Howshouldwelearnitsparameters?

Page 9: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

NaïveBayesforDigits

§ NaïveBayes:Assumeallfeaturesareindependenteffectsofthelabel

§ Simpledigitrecognitionversion:§ Onefeature(variable)Fij foreachgridposition<i,j>§ Featurevaluesareon/off,basedonwhetherintensity

ismoreorlessthan0.5inunderlyingimage§ Eachinputmapstoafeaturevector,e.g.

§ Here:lotsoffeatures,eachisbinaryvalued

§ NaïveBayes model:

§ Whatdoweneedtolearn?

Y

F1 FnF2

Page 10: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

GeneralNaïveBayes

§ AgeneralNaiveBayes model:

§ Weonlyhavetospecifyhoweachfeaturedependsontheclass§ Totalnumberofparametersislinear inn§ Modelisverysimplistic,butoftenworksanyway

Y

F1 FnF2

|Y|parameters

nx|F|x|Y|parameters

|Y|x|F|n values

Page 11: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

InferenceforNaïveBayes

§ Goal:computeposteriordistributionoverlabelvariableY§ Step1:getjointprobabilityoflabelandevidenceforeachlabel

§ Step2:sumtogetprobabilityofevidence

§ Step3:normalizebydividingStep1byStep2

+

Page 12: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

GeneralNaïveBayes

§ WhatdoweneedinordertouseNaïveBayes?

§ Inferencemethod(wejustsawthispart)§ Startwithabunchofprobabilities:P(Y)andtheP(Fi|Y)tables§ UsestandardinferencetocomputeP(Y|F1…Fn)§ Nothingnewhere

§ Estimatesoflocalconditionalprobabilitytables§ P(Y),theprioroverlabels§ P(Fi|Y)foreachfeature(evidencevariable)§ Theseprobabilitiesarecollectivelycalledtheparameters ofthemodelanddenotedbyq

§ Upuntilnow,weassumedtheseappearedbymagic,but…§ …theytypicallycomefromtrainingdatacounts:we’lllookatthissoon

Page 13: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

Example:ConditionalProbabilities

1 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.10 0.1

1 0.012 0.053 0.054 0.305 0.806 0.907 0.058 0.609 0.500 0.80

1 0.052 0.013 0.904 0.805 0.906 0.907 0.258 0.859 0.600 0.80

Page 14: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

ASpamFilter

§ NaïveBayesspamfilter

§ Data:§ Collectionofemails,labeled

spamorham§ Note:someonehastohand

labelallthisdata!§ Splitintotraining,held-out,

testsets

§ Classifiers§ Learnonthetrainingset§ (Tuneitonaheld-outset)§ Testitonnewemails

Dear Sir.

First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. …

TO BE REMOVED FROM FUTURE MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE SUBJECT.

99 MILLION EMAIL ADDRESSESFOR ONLY $99

Ok, Iknow this is blatantly OT but I'm beginning to go insane. Had an old Dell Dimension XPS sitting in the corner and decided to put it to use, I know it was working pre being stuck in the corner, but when I plugged it in, hit the power nothing happened.

Page 15: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

NaïveBayesforText

§ Bag-of-wordsNaïveBayes:§ Features:Wi isthewordatpositon i§ Asbefore:predictlabelconditionedonfeaturevariables(spamvs.ham)§ Asbefore:assumefeaturesareconditionallyindependentgivenlabel§ New:eachWi isidenticallydistributed

§ Generativemodel:

§ “Tied”distributionsandbag-of-words§ Usually,eachvariablegetsitsownconditionalprobabilitydistributionP(F|Y)§ Inabag-of-wordsmodel

§ Eachposition isidenticallydistributed§ Allpositionssharethesameconditional probs P(W|Y)§ Whymakethisassumption?

§ Called“bag-of-words”becausemodelisinsensitivetowordorderorreordering

Wordatpositioni,not ith wordinthedictionary!

When the lecture is over, remember to wake up the person sitting next to you in the lecture room.

in is lecture lecture next over person remember room sitting the the the to to up wake when you

how many variables are there?how many values?

Page 16: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

Example:SpamFiltering

§ Model:

§ Whataretheparameters?

§ Wheredothesetablescomefrom?

the : 0.0156to : 0.0153and : 0.0115of : 0.0095you : 0.0093a : 0.0086with: 0.0080from: 0.0075...

the : 0.0210to : 0.0133of : 0.01192002: 0.0110with: 0.0108from: 0.0107and : 0.0105a : 0.0100...

ham : 0.66spam: 0.33

Page 17: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

SpamExample

Word P(w|spam) P(w|ham) Tot Spam Tot Ham(prior) 0.33333 0.66666 -1.1 -0.4Gary 0.00002 0.00021 -11.8 -8.9would 0.00069 0.00084 -19.1 -16.0you 0.00881 0.00304 -23.8 -21.8like 0.00086 0.00083 -30.9 -28.9to 0.01517 0.01339 -35.1 -33.2lose 0.00008 0.00002 -44.5 -44.0weight 0.00016 0.00002 -53.3 -55.0while 0.00027 0.00027 -61.5 -63.2you 0.00881 0.00304 -66.2 -69.0sleep 0.00006 0.00001 -76.0 -80.5

P(spam | w) = 98.9

Page 18: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

TrainingandTesting

Page 19: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

ImportantConcepts

§ Data:labeledinstances,e.g.emailsmarkedspam/ham§ Trainingset§ Heldoutset§ Testset

§ Features:attribute-valuepairswhichcharacterizeeachx

§ Experimentationcycle§ Learnparameters(e.g.modelprobabilities)ontrainingset§ (Tunehyperparameters onheld-outset)§ Computeaccuracyoftestset§ Veryimportant:never“peek”atthetestset!

§ Evaluation§ Accuracy:fractionofinstancespredictedcorrectly

§ Overfitting andgeneralization§ Wantaclassifierwhichdoeswellontest data§ Overfitting:fittingthetrainingdataveryclosely,butnot

generalizingwell§ Underfitting:fitsthetrainingsetpoorly

TrainingData

Held-OutData

TestData

Page 20: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

UnderfittingandOverfitting

Page 21: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

0 2 4 6 8 10 12 14 16 18 20-15

-10

-5

0

5

10

15

20

25

30

Degree15polynomial

Overfitting

Page 22: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

Example:Overfitting

2wins!!

Page 23: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

Example:Overfitting

§ Posteriorsdeterminedbyrelativeprobabilities(oddsratios):

south-west : infnation : infmorally : infnicely : infextent : infseriously : inf...

Whatwentwronghere?

screens : infminute : infguaranteed : inf$205.00 : infdelivery : infsignature : inf...

Page 24: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

GeneralizationandOverfitting

§ Relativefrequencyparameterswilloverfit thetrainingdata!§ Justbecauseweneversawa3withpixel(15,15)onduring trainingdoesn’tmeanwewon’tseeitattesttime§ Unlikelythateveryoccurrenceof“minute”is100%spam§ Unlikelythateveryoccurrenceof“seriously”is100%ham§ Whataboutallthewordsthatdon’toccurinthetrainingsetatall?§ Ingeneral,wecan’tgoaroundgivingunseeneventszeroprobability

§ Asanextremecase,imagineusingtheentireemailastheonlyfeature§ Wouldgetthetrainingdataperfect(ifdeterministic labeling)§ Wouldn’t generalize atall§ Justmaking thebag-of-wordsassumption givesussomegeneralization, butisn’tenough

§ Togeneralizebetter:weneedtosmoothorregularizetheestimates

Page 25: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

ParameterEstimation

Page 26: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

ParameterEstimation

§ Estimatingthedistributionofarandomvariable

§ Elicitation: askahuman(whyisthishard?)

§ Empirically:usetrainingdata(learning!)§ E.g.:foreachoutcomex,lookattheempiricalrate ofthatvalue:

§ Thisistheestimatethatmaximizesthelikelihoodofthedata

r r b

r b b

r bbrb b

r bb

r

b

b

Page 27: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

YourFirstConsultingJob

§ Abillionairetechentrepreneurasksyouaquestion:§ Hesays:Ihavethumbtack,ifIflipit,what’stheprobabilityitwillfallwiththenailup?

§ Yousay:Pleaseflipitafewtimes:

§ Yousay:Theprobabilityis:§ P(H)=3/5

§ Hesays:Why???§ Yousay:Because…

Page 28: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

YourFirstConsultingJob

§ P(Heads)=q,P(Tails)=1-q

§ Flipsarei.i.d.:§ Independentevents§ Identicallydistributedaccordingtounknowndistribution

§ SequenceD ofaH HeadsandaT Tails

D={xi | i=1…n}, P(D | θ) = ΠiP(xi | θ)

Page 29: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

MaximumLikelihoodEstimation

§ Data:ObservedsetD ofaH HeadsandaT Tails§ Hypothesisspace: Bernoullidistributions§ Learning:findingqisanoptimizationproblem

§ What’stheobjectivefunction?

§ MLE:ChooseqtomaximizeprobabilityofD

Page 30: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

MaximumLikelihoodEstimation

§ Setderivativetozero,andsolve!

Brief Article

The Author

January 11, 2012

⇥̂ = argmax⇥

lnP (D | ⇥)

ln ⇥�H

d

d⇥lnP (D | ⇥) =

d

d⇥[ln ⇥�H (1� ⇥)�T ] =

d

d⇥[�H ln ⇥ + �T ln(1� ⇥)]

1

Brief Article

The Author

January 11, 2012

⇥̂ = argmax⇥

lnP (D | ⇥)

ln ⇥�H

d

d⇥lnP (D | ⇥) =

d

d⇥[ln ⇥�H (1� ⇥)�T ] =

d

d⇥[�H ln ⇥ + �T ln(1� ⇥)]

1

Brief Article

The Author

January 11, 2012

⇥̂ = argmax⇥

lnP (D | ⇥)

ln ⇥�H

d

d⇥lnP (D | ⇥) =

d

d⇥[ln ⇥�H (1� ⇥)�T ] =

d

d⇥[�H ln ⇥ + �T ln(1� ⇥)]

= �Hd

d⇥ln ⇥ + �T

d

d⇥ln(1� ⇥) =

�H

⇥� �T

1� ⇥= 0

1

Brief Article

The Author

January 11, 2012

⇥̂ = argmax⇥

lnP (D | ⇥)

ln ⇥�H

d

d⇥lnP (D | ⇥) =

d

d⇥[ln ⇥�H (1� ⇥)�T ] =

d

d⇥[�H ln ⇥ + �T ln(1� ⇥)]

= �Hd

d⇥ln ⇥ + �T

d

d⇥ln(1� ⇥) =

�H

⇥� �T

1� ⇥= 0

1

Brief Article

The Author

January 11, 2012

�̂ = argmax✓

lnP (D | �)

ln �↵H

d

d�lnP (D | �) =

d

d�ln �↵H (1� �)↵T

1

Brief Article

The Author

January 11, 2012

⇥̂ = argmax⇥

lnP (D | ⇥)

ln ⇥�H

d

d⇥lnP (D | ⇥) =

d

d⇥[ln ⇥�H (1� ⇥)�T ] =

d

d⇥[�H ln ⇥ + �T ln(1� ⇥)]

= �Hd

d⇥ln ⇥ + �T

d

d⇥ln(1� ⇥) =

�H

⇥� �T

1� ⇥= 0

1

Page 31: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

Smoothing

Page 32: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

MaximumLikelihood?

§ Relativefrequenciesarethemaximumlikelihoodestimates

§ Anotheroptionistoconsiderthemostlikelyparametervaluegiventhedata

????

Page 33: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

UnseenEvents

Page 34: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

LaplaceSmoothing

§ Laplace’sestimate:§ Pretendyousaweveryoutcome

oncemorethanyouactuallydid

§ CanderivethisestimatewithDirichlet priors (seecs281a)

r r b

Page 35: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

LaplaceSmoothing

§ Laplace’sestimate(extended):§ Pretendyousaweveryoutcomekextratimes

§ What’sLaplacewithk=0?§ kisthestrength oftheprior

§ Laplaceforconditionals:§ Smootheachconditionindependently:

r r b

Page 36: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

RealNB:Smoothing

§ Forrealclassificationproblems,smoothingiscritical§ Newoddsratios:

helvetica : 11.4seems : 10.8group : 10.2ago : 8.4areas : 8.3...

verdana : 28.8Credit : 28.4ORDER : 27.2<FONT> : 26.9money : 26.5...

Dothesemakemoresense?

Page 37: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

Tuning

Page 38: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

TuningonHeld-OutData

§ Nowwe’vegottwokindsofunknowns§ Parameters: theprobabilitiesP(X|Y),P(Y)§ Hyperparameters:e.g.theamount/typeofsmoothingtodo,k,a

§ Whatshouldwelearnwhere?§ Learnparametersfromtrainingdata§ Tunehyperparameters ondifferentdata

§ Why?§ Foreachvalueofthehyperparameters,trainandtestontheheld-outdata

§ Choosethebestvalueanddoafinaltestonthetestdata

Page 39: Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

Summary

§ Bayesruleletsusdodiagnosticquerieswithcausalprobabilities

§ ThenaïveBayesassumptiontakesallfeaturestobeindependentgiventheclasslabel

§ WecanbuildclassifiersoutofanaïveBayesmodelusingtrainingdata

§ Smoothingestimatesisimportantinrealsystems