density estimationaarti/class/10315_fall19/recitations/recitation_2.pdfcase study: text...

37
Density estimation

Upload: others

Post on 16-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Density estimation

Page 2: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Density Estimation

• A Density Estimator learns a mapping from a set of attributes to a Probability

Density Estimator

Probability

Input data for a variable or a set of

variables

Page 3: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Density estimation

• Estimate the distribution (or conditional distribution) of a random variable

• Types of variables:

- Binary

coin flip, alarm

- Discrete

dice, car model year

- Continuous

height, weight, temp.,

Page 4: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

When do we need to estimate densities? • Density estimators are critical ingredients in several of the ML algorithms we will

discuss

• In some cases these are combined with other inference types for more involved

algorithms (i.e. EM) while in others they are part of a more general process

(learning in BNs and HMMs)

Page 5: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Density estimation

• Binary and discrete variables:

• Continuous variables:

Easy: Just count!

Harder (but just a bit): Fit

a model

Page 6: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Learning a density estimator for discrete

variables

ˆ P (x i u) #records in which x i u

total number of records

A trivial learning algorithm!

But why is this true?

Page 7: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Maximum Likelihood Principle

M is our model (usually a

collection of parameters)

ˆ P (dataset | M) ˆ P (x1 x2 xn | M) ˆ P (xk | M)k1

n

We can define the likelihood of the data given the model as

follows:

For example M is

- The probability of ‘head’ for a coin flip

- The probabilities of observing 1,2,3,4 and 5 for a dice

- etc.

Page 8: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Maximum Likelihood Principle

• Our goal is to determine the values for the parameters in M

• We can do this by maximizing the probability of generating the observed

samples

• For example, let be the probabilities for a coin flip

• Then

L(x1, … ,xn | ) = p(x1 | ) … p(xn | )

• The observations (different flips) are assumed to be independent

• For such a coin flip with P(H)=q the best assignment for h is

argmaxq = #H/#samples

• Why?

ˆ P (dataset | M) ˆ P (x1 x2 xn | M) ˆ P (xk | M)k1

n

Page 9: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

• For a binary random variable A with P(A=1)=q

argmaxq = #1/#samples

• Why?

Data likelihood:

We would like to find:

Maximum Likelihood Principle: Binary

variables

21 )1()|(nn

qqMDP

21 )1(maxargnn

q qq

Omitting terms that

do not depend on q

Page 10: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Data likelihood:

We would like to find:

Maximum Likelihood Principle 21 )1()|(

nnqqMDP

21 )1(maxargnn

q qq

21

1

211

21

21

11

1

2

1

1

1

2

1

1

0)1(

0))1(()1(

0)1()1(

0

)1()1()1(

21

2121

212121

nn

nq

qnqnn

qnqn

qnqnqq

qnqqqn

q

qnqqqnqqq

nn

nnnn

nnnnnn

Page 11: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Recall: Your first consulting job

• A billionaire from the suburbs of Seattle asks you a question:

–He says: I have a coin, if I flip it, what’s the probability it will fall with the head

up?

–You say: Please flip it a few times:

–You say: The probability is: 3/5 because… frequency of heads in all flips

–He says: But can I put money on this estimate?

–You say: ummm…. Maybe not.

– Not enough flips (less than sample complexity)

Page 12: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

What about prior knowledge?

• Billionaire says: Wait, I know that the coin is “close” to 50-50. What can

you do for me now?

• You say: I can learn it the Bayesian way…

• Rather than estimating a single , we obtain a distribution over possible

values of

50-50

Before data After data

Page 13: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Bayesian Learning

18

• Use Bayes rule:

• Or equivalently:

posterior likelihood prior

Page 14: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Prior distribution

• From where do we get the prior?

- Represents expert knowledge (philosophical approach)

- Simple posterior form (engineer’s approach)

• Uninformative priors:

- Uniform distribution

• Conjugate priors:

- Closed-form representation of posterior

- P(q) and P(q|D) have the same algebraic form as a function of \theta

Page 15: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Conjugate Prior

• P(q) and P(q|D) have the same form as a function of theta

Eg. 1 Coin flip problem

Likelihood given Bernoulli model:

If prior is Beta distribution,

Then posterior is Beta distribution

For Binomial, conjugate prior is Beta distribution.

22

Page 16: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Beta distribution

More concentrated as values of bH, bT increase

Page 17: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Beta conjugate prior

As n = aH + aT

increases

As we get more samples, effect of prior is “washed out”

Page 18: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Conjugate Prior The posterior p(θ | x) is in the same probability distribution family as the prior probability distribution p(θ)

Page 19: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Conjugate Prior

• P() and P(|D) have the same form

Eg. 2 Dice roll problem (6 outcomes instead of 2)

Likelihood is ~ Multinomial( {1, 2, … , k})

If prior is Dirichlet distribution,

Then posterior is Dirichlet distribution

For Multinomial, conjugate prior is Dirichlet distribution.

Page 20: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Posterior Distribution

• The approach seen so far is what is known as a Bayesian approach

• Prior information encoded as a distribution over possible values of parameter

• Using the Bayes rule, you get an updated posterior distribution over parameters,

which you provide with flourish to the Billionaire

• But the billionaire is not impressed

- Distribution? I just asked for one number: is it 3/5, 1/2, what is it?

- How do we go from a distribution over parameters, to a single estimate of the

true parameters?

Page 21: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Maximum A Posteriori Estimation

Choose that maximizes a posterior probability

MAP estimate of probability of head:

27

Mode of Beta

distribution

Page 22: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

NaïveBayesClassifier

17

• BayesClassifierwithadditional“naïve”assumption:– Featuresareindependentgivenclass:

– Moregenerally:

• Ifconditionalindependenceassumptionholds,NBisoptimalclassifier!Butworseotherwise.

X =

X1

X2

=

2

664

X1

X2

. . .Xd

3

775X =

X1

X2

Page 23: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

ConditionalIndependence

18

• Xisconditionallyindependent ofYgivenZ:probabilitydistributiongoverningXisindependentofthevalueofY,giventhevalueofZ

• Equivalentto:

• e.g.,Note: doesNOTmeanThunderisindependentofRain

Page 24: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Conditionalvs.MarginalIndependence

19

Page 25: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Conditionalvs.MarginalIndependence

20

Wearingcoatsisindependentofaccidentsconditionedonthefactthatitrained

Page 26: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Handwrittendigitrecognition(discretefeatures)

23

TrainingData:

Howmanyparameters?

ClassprobabilityP(Y=y)=py forally

Classconditionaldistributionoffeatures(usingNaïveBayesassumption)

P(Xi =xi|Y =y)– oneprobabilityvalueforeachy,pixeli

K-1ifKlabels

Kd

1 2

…nblack-white(1/0)imageswithdpixels

…nlabels

X

Y

=

2

664

X1

X2

. . .Xd

3

775

May not hold

LinearinsteadofExponentialind!

Page 27: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

NaïveBayesClassifier

24

• BayesClassifierwithadditional“naïve”assumption:– Featuresareindependentgivenclass:

• Hasfewerparameters,andhencerequiresfewertrainingdata,eventhoughassumptionmaybeviolatedinpractice

Page 28: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

NaïveBayes Algo – Discretefeatures

• TrainingData

• MaximumLikelihoodEstimates– ForClassprobability

– Forclassconditionaldistribution

• NBPredictionfortestdata

25

Page 29: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

NaïveBayes Algo – Discretefeatures

• TrainingData

• MaximumAPosteriori(MAP)Estimates– addm“virtual”datapts

Assumegivensomepriordistribution(typicallyuniform):

MAPEstimate

Now,evenifyouneverobserveaclass/featureposteriorprobabilityneverzero.

28

#virtualexampleswithY=b

Page 30: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

CaseStudy:TextClassification

29

• Classifye-mails– Y={Spam,NotSpam}

• Classifynewsarticles– Y={whatisthetopicofthearticle?}

• Classifywebpages– Y={Student,professor,project,…}

• WhataboutthefeaturesX?– Thetext!

Page 31: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Bagofwordsapproach

30

aardvark 0

about 2

all 2

Africa 1

apple 0

anxious 0

...

gas 1

...

oil 1

Zaire 0

Page 32: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Bagofwordsmodel

32

• Typicaladditionalassumption– Positionindocumentdoesn’tmatter– “Bagofwords”model– orderofwordsonthepageignored– Soundsreallysilly,butoftenworksverywell!

inislecturelecture nextoverpersonrememberroomsittingthethe the toto upwakewhenyou

Page 33: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Bagofwordsmodel

33

• Typicaladditionalassumption– Positionindocumentdoesn’tmatter– “Bagofwords”model– orderofwordsonthepageignored– Soundsreallysilly,butoftenworksverywell!

Whenthelectureisover,remembertowakeupthepersonsittingnexttoyouinthelectureroom.

Page 34: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

NBforTextClassification

31

• FeaturesX arethecountofhowmanytimeseachwordinthevocabularyappearsindocument

• ProbabilitytableforP(X|Y)ishuge!!!

• NBassumptionhelpsalot!!!

• Bagofwords+NaïveBayesassumptionimplyP(X|Y=y)isjusttheproduct ofprobabilityofeachword,raisedtoitscount, inadocumentontopicy

Page 35: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

NBwithBagofWordsfortextclassification

34

• Learningphase:– ClassPriorP(Y):fraction oftimestopicYappearsinthecollectionofdocuments

– P(w|Y):fractionoftimesword wappearsindocumentswithtopicY

• Testphase:– Foreachdocument

• UseBagofwords+naïveBayesdecisionrule

Page 36: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Twentynewsgroupsresults

35

Page 37: Density estimationaarti/Class/10315_Fall19/recitations/recitation_2.pdfCase Study: Text Classification 29 • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles –

Discriminativevs GenerativeClassifiers

43

Generative(Modelbased)approach:e.g.NaïveBayes• AssumesomeprobabilitymodelforP(Y)andP(X|Y)• Estimateparametersofprobabilitymodelsfromtrainingdata

Discriminative(Modelfree)approach:e.g.LogisticregressionWhynotlearnP(Y|X)directly?Orbetteryet,whynotlearnthedecisionboundarydirectly?• AssumesomefunctionalformforP(Y|X)orforthedecisionboundary• Estimateparametersoffunctionalformdirectlyfromtrainingdata

OptimalClassifier: