inf2b learning and data - the university of edinburgh...inf2b learning and data lecture 6: naive...

23
Inf2b Learning and Data Lecture 6: Naive Bayes Hiroshi Shimodaira (Credit: Iain Murray and Steve Renals) Centre for Speech Technology Research (CSTR) School of Informatics University of Edinburgh Jan-Mar 2014 Inf2b Learning and Data: Lecture 6 Naive Bayes 1

Upload: others

Post on 22-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Inf2b Learning and Data - The University of Edinburgh...Inf2b Learning and Data Lecture 6: Naive Bayes Hiroshi Shimodaira (Credit: Iain Murray and Steve Renals) Centre for Speech Technology

Inf2b Learning and DataLecture 6: Naive Bayes

Hiroshi Shimodaira(Credit: Iain Murray and Steve Renals)

Centre for Speech Technology Research (CSTR)School of Informatics

University of Edinburgh

Jan-Mar 2014

Inf2b Learning and Data: Lecture 6 Naive Bayes 1

Page 2: Inf2b Learning and Data - The University of Edinburgh...Inf2b Learning and Data Lecture 6: Naive Bayes Hiroshi Shimodaira (Credit: Iain Murray and Steve Renals) Centre for Speech Technology

Today’s Schedule

1 Bayes decision rule review

2 The curse of dimensionality

3 Naive Bayes

4 Text classification using Naive Bayes (introduction)

Inf2b Learning and Data: Lecture 6 Naive Bayes 2

Page 3: Inf2b Learning and Data - The University of Edinburgh...Inf2b Learning and Data Lecture 6: Naive Bayes Hiroshi Shimodaira (Credit: Iain Murray and Steve Renals) Centre for Speech Technology

Bayes decision rule (recap)

Class C = {c1, . . . , cK}; input features X = x

Most probable class: (maximum posterior class)

c∗ = arg maxck

P(ck | x)

= arg maxck

P(x | ck) P(ck)

P(x)= arg max

ck

P(x | ck) P(ck)∑Kj=1 P(x|cj)P(cj)

= arg maxck

P(x|ck)P(ck)

where P(ck | x) : posteriorP(x | ck) : likelihood

P(ck) : prior

⇒ Minimum error (misclassification) rate classification(PRML C. M. Bishop (2006) Section 1.5)

Inf2b Learning and Data: Lecture 6 Naive Bayes 3

Page 4: Inf2b Learning and Data - The University of Edinburgh...Inf2b Learning and Data Lecture 6: Naive Bayes Hiroshi Shimodaira (Credit: Iain Murray and Steve Renals) Centre for Speech Technology

Fish classification (revisited)

Bayesian class estimation:

P(ck | x) =P(x | ck) P(ck)

P(x)∝ P(x | ck) P(ck)

Estimating the terms: (Non-Bayesian)

Priors: P(C =M) ≈ NM

NM + NF, . . .

0

0.05

0.1

0.15

0.2

0.25

0 5 10 15 20 25Likelihoods: P(x |C =M) ≈ nM(x)

NM, . . .

NB: These approximations work well only if we have enough data

Inf2b Learning and Data: Lecture 6 Naive Bayes 4

Page 5: Inf2b Learning and Data - The University of Edinburgh...Inf2b Learning and Data Lecture 6: Naive Bayes Hiroshi Shimodaira (Credit: Iain Murray and Steve Renals) Centre for Speech Technology

Fish classification (revisited)

P(ck |x) =P(x |ck)P(ck)

P(x)

P(M) : P(F ) = 1 : 1

P(x |ck)

0

0.05

0.1

0.15

0.2

0.25

0 5 10 15 20

Like

liho

od P

(x|C

)

Length / cm

P(x|M)P(x|F)

P(ck |x)

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20

Post

erio

r: P

(C|x

)

Length / cm

P(M|x)

P(F|x)

Inf2b Learning and Data: Lecture 6 Naive Bayes 5

Page 6: Inf2b Learning and Data - The University of Edinburgh...Inf2b Learning and Data Lecture 6: Naive Bayes Hiroshi Shimodaira (Credit: Iain Murray and Steve Renals) Centre for Speech Technology

Fish classification (revisited)

P(ck |x) =P(x |ck)P(ck)

P(x)

P(M) : P(F ) = 1 : 4

P(x |ck)

0

0.05

0.1

0.15

0.2

0.25

0 5 10 15 20

Like

liho

od P

(x|C

)

Length / cm

P(x|M)P(x|F)

P(ck |x)

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20

Post

erio

r: P

(C|x

)

Length / cm

P(M|x)

P(F|x)

Inf2b Learning and Data: Lecture 6 Naive Bayes 6

Page 7: Inf2b Learning and Data - The University of Edinburgh...Inf2b Learning and Data Lecture 6: Naive Bayes Hiroshi Shimodaira (Credit: Iain Murray and Steve Renals) Centre for Speech Technology

1 Bayes decision rule review

2 The curse of dimensionality

3 Naive Bayes

4 Text classification using Naive Bayes (introduction)

Inf2b Learning and Data: Lecture 6 Naive Bayes 7

Page 8: Inf2b Learning and Data - The University of Edinburgh...Inf2b Learning and Data Lecture 6: Naive Bayes Hiroshi Shimodaira (Credit: Iain Murray and Steve Renals) Centre for Speech Technology

How can we improve the fish classification?

0 5 10 15 200

20

40F

req

ue

ncy

Length / cm

Lengths of male fish

0 5 10 15 200

20

40

Fre

qu

en

cy

Length / cm

Lengths of female fish

Inf2b Learning and Data: Lecture 6 Naive Bayes 8

Page 9: Inf2b Learning and Data - The University of Edinburgh...Inf2b Learning and Data Lecture 6: Naive Bayes Hiroshi Shimodaira (Credit: Iain Murray and Steve Renals) Centre for Speech Technology

More features!?

P(x | ck) ≈ nck (x1, . . . , xd)

Nck

1D histogram

2D histogram 0

5

10

x1

x2

nC

( x )

3D cube of numbers

...

100 binary variables, 2100 settings (the universe is ≈ 298 picoseconds old)

In high dimensions almost all nC (x1, . . . , xD) are zero

⇒ Bellman’s “curse of dimensionality”

Inf2b Learning and Data: Lecture 6 Naive Bayes 9

Page 10: Inf2b Learning and Data - The University of Edinburgh...Inf2b Learning and Data Lecture 6: Naive Bayes Hiroshi Shimodaira (Credit: Iain Murray and Steve Renals) Centre for Speech Technology

Avoiding the Curse of Dimensionality

Apply the chain rule?

P(x |ck)= P(x1, x2, . . . , xd |ck)

= P(x1|ck) P(x2|x1, ck) P(x3|x2, x1, ck) P(x4|x3, x2, x1, ck) · · ·· · ·P(xd−1|xd−2, . . . , x1, ck) P(xd |xd−1, . . . , x1, ck)

Solution: assume structure in P(x | ck)

For example,

Assume xi+1 depends on xi only

P(x |ck) ≈ P(x1|ck)P(x2|x1, ck)P(x3|x2, ck) · · ·P(xd |xd−1, ck)

Assume x ∈ Rd distributes in a low dimensional vectorspace

Dimensionality reduction by PCA (Principal ComponentAnalysis) / KL-transform

Inf2b Learning and Data: Lecture 6 Naive Bayes 10

Page 11: Inf2b Learning and Data - The University of Edinburgh...Inf2b Learning and Data Lecture 6: Naive Bayes Hiroshi Shimodaira (Credit: Iain Murray and Steve Renals) Centre for Speech Technology

Avoiding the Curse of Dimensionality

Apply smoothing windows (e.g. Parzen windows)

Apply a probability distribution model (e.g. Normal dist.)

Assume x1, . . . , xd are independent from each other

⇒ Naive Bayes rule/model (or idiot Bayes rule)

P(x1, x2, . . . , xd |ck) ≈ P(x1|ck) P(x2|ck) · · ·P(xd |ck)

=d∏

i=1

P(xi |ck)

Is it reasonable?Often not, of course!

Although it can still be useful.

Inf2b Learning and Data: Lecture 6 Naive Bayes 11

Page 12: Inf2b Learning and Data - The University of Edinburgh...Inf2b Learning and Data Lecture 6: Naive Bayes Hiroshi Shimodaira (Credit: Iain Murray and Steve Renals) Centre for Speech Technology

Example - game played depending on the weather

Outlook Temperature Humidity Windy Play

sunny hot high false NOsunny hot high true NO

overcast hot high false YESrainy mild high false YESrainy cool normal false YESrainy cool normal true NO

overcast cool normal true YESsunny mild high false NOsunny cool normal false YESrainy mild normal false YESsunny mild normal true YES

overcast mild high true YESovercast hot normal false YES

rainy mild high true NO

P(Play |O,T ,H ,W ) =P(O,T ,H ,W |Play)P(O,T ,H ,W )

P(Play)

# of combinations of (O,T ,H ,W ) = 3× 3× 2× 2 = 36Inf2b Learning and Data: Lecture 6 Naive Bayes 12

Page 13: Inf2b Learning and Data - The University of Edinburgh...Inf2b Learning and Data Lecture 6: Naive Bayes Hiroshi Shimodaira (Credit: Iain Murray and Steve Renals) Centre for Speech Technology

Applying Naive Bayes

P(Play |O,T ,H ,W ) =P(O,T ,H ,W |Play) P(Play)

P(O,T ,H ,W )

∝ P(O,T ,H ,W |Play) P(Play)

Applying the Naive Bayes rule,

P(O,T ,H ,W |Play) ≈ P(O|Play) P(T |Play) P(H |Play) P(W |Play)

Inf2b Learning and Data: Lecture 6 Naive Bayes 13

Page 14: Inf2b Learning and Data - The University of Edinburgh...Inf2b Learning and Data Lecture 6: Naive Bayes Hiroshi Shimodaira (Credit: Iain Murray and Steve Renals) Centre for Speech Technology

Relative frequencies

Consider each feature independentlyto estimate P(O|Play),P(T |Play),P(H |Play),P(W |Play)

Outlook Y N

sunny 2/9 3/5overcast 4/9 0/5rainy 3/9 2/5

Temperature Y N

hot 2/9 2/5mild 4/9 2/5cool 3/9 1/5

Humidity Y N

high 3/9 4/5normal 6/9 1/5

Windy Y N

false 6/9 2/5true 3/9 3/5

There was play 9 out of 14 times: P(Play=Y ) ≈ 914

Inf2b Learning and Data: Lecture 6 Naive Bayes 14

Page 15: Inf2b Learning and Data - The University of Edinburgh...Inf2b Learning and Data Lecture 6: Naive Bayes Hiroshi Shimodaira (Credit: Iain Murray and Steve Renals) Centre for Speech Technology

Applying Naive Bayes

Posterior play probability: x = (sunny, cool, humid, windy)

P(Play | x) ∝ P(x |Play) P(Play)

Estimating the Naive Bayes likelihood: (Non-Bayesian)

P(x |Play=Y ) = P(O =s |Y )P(T =c |Y )P(H =h |Y )P(W = t |Y )

≈ 2

9· 3

9· 3

9· 3

9

P(x |Play=N) = P(O =s |N)P(T =c |N)P(H =h |N)P(W = t |N)

≈ 3

5· 1

5· 4

5· 3

5

Exercise: find the odds of play, P(play=Y | x)/P(play=N | x)(answer in notes)

Inf2b Learning and Data: Lecture 6 Naive Bayes 15

Page 16: Inf2b Learning and Data - The University of Edinburgh...Inf2b Learning and Data Lecture 6: Naive Bayes Hiroshi Shimodaira (Credit: Iain Murray and Steve Renals) Centre for Speech Technology

Naive Bayes properties

Easy and cheap:Record counts, convert to frequencies, score eachclass by multiplying prior and likelihood terms

P(x|ck) ∝(∏d

i=1 P(xi |ck))

P(ck)

Statistically viable:Simple count-based estimates work in 1D

Often overconfident:Treats dependent evidence as independent

Inf2b Learning and Data: Lecture 6 Naive Bayes 16

Page 17: Inf2b Learning and Data - The University of Edinburgh...Inf2b Learning and Data Lecture 6: Naive Bayes Hiroshi Shimodaira (Credit: Iain Murray and Steve Renals) Centre for Speech Technology

1 Bayes decision rule review

2 The curse of dimensionality

3 Naive Bayes

4 Text classification using Naive Bayes (introduction)

Inf2b Learning and Data: Lecture 6 Naive Bayes 17

Page 18: Inf2b Learning and Data - The University of Edinburgh...Inf2b Learning and Data Lecture 6: Naive Bayes Hiroshi Shimodaira (Credit: Iain Murray and Steve Renals) Centre for Speech Technology

Identifying Spam

Spam?

I got your contact information from your countrysinformation directory during my desperate search forsomeone who can assist me secretly andconfidentially in relocating and managing somefamily fortunes.

Inf2b Learning and Data: Lecture 6 Naive Bayes 18

Page 19: Inf2b Learning and Data - The University of Edinburgh...Inf2b Learning and Data Lecture 6: Naive Bayes Hiroshi Shimodaira (Credit: Iain Murray and Steve Renals) Centre for Speech Technology

Identifying Spam

Spam?

Dear Dr. Steve Renals, The proof for your article,Combining Spectral Representations forLarge-Vocabulary Continuous Speech Recognition, isready for your review. Please access your proof viathe user ID and password provided below. Kindly login to the website within 48 HOURS of receiving thismessage so that we may expedite the publicationprocess.

Inf2b Learning and Data: Lecture 6 Naive Bayes 19

Page 20: Inf2b Learning and Data - The University of Edinburgh...Inf2b Learning and Data Lecture 6: Naive Bayes Hiroshi Shimodaira (Credit: Iain Murray and Steve Renals) Centre for Speech Technology

Identifying Spam

Spam?

Congratulations to you as we bring to your notice,the results of the First Category draws of THEHOLLAND CASINO LOTTO PROMO INT. We arehappy to inform you that you have emerged a winnerunder the First Category, which is part of ourpromotional draws.

Inf2b Learning and Data: Lecture 6 Naive Bayes 20

Page 21: Inf2b Learning and Data - The University of Edinburgh...Inf2b Learning and Data Lecture 6: Naive Bayes Hiroshi Shimodaira (Credit: Iain Murray and Steve Renals) Centre for Speech Technology

Identifying Spam

Question

How can we identify an email as spam automatically?

Text classification: classify email messages as spam ornon-spam (ham), based on the words they contain

Inf2b Learning and Data: Lecture 6 Naive Bayes 21

Page 22: Inf2b Learning and Data - The University of Edinburgh...Inf2b Learning and Data Lecture 6: Naive Bayes Hiroshi Shimodaira (Credit: Iain Murray and Steve Renals) Centre for Speech Technology

Text Classification using Bayes Theorem

Document D, with class ck

Classify D as the class with the highest posteriorprobability:

P(ck |D) =P(D|ck)P(ck)

P(D)∝ P(D|ck)P(ck)

How do we represent D? How do we estimate P(D|ck)?

Bernoulli document model: a document is representedby a binary feature vector, whose elements indicateabsence or presence of corresponding word in thedocument

Multinomial document model: a document isrepresented by an integer feature vector, whose elementsindicate frequency of corresponding word in the document

Inf2b Learning and Data: Lecture 6 Naive Bayes 22

Page 23: Inf2b Learning and Data - The University of Edinburgh...Inf2b Learning and Data Lecture 6: Naive Bayes Hiroshi Shimodaira (Credit: Iain Murray and Steve Renals) Centre for Speech Technology

Summary

The curse of dimensionality

Naive Bayes approximation

Example: classifying multidimensional data using NaiveBayes

Next lecture: Text classification using Naive Bayes

Inf2b Learning and Data: Lecture 6 Naive Bayes 23