5) 6) 7) 8) 9) introduction to - university of …...naive bayes for spam filtering class word 1...

v=1v=–1 v=–1 v=–1

v=–1 v=1

v=–1

optimaalin

en peli

5) 6) 7) 8) 9)

I N T R O D U C T I O N T O A R T I F I C I A L I N T E L L I G E N C E

D A TA 1 5 0 0 1

E P I S O D E 5 : P R O B A B I L I S T I C I N F E R E N C E

1. B AY E S R U L E

2. N A I V E B AY E S C L A S S I F I E R

3. S PA M F I LT E R

T O D A Y ’ S M E N U

H O W T O M A K E M O N E Y

FROM: "MARGARETTA NITA" <MARGUERITESEBRINA@WMLE.COM>SUBJECT: SPECIAL OFFER : VIAGRA ON SALE AT $1.38 !!!X-BOGOSITY: YES, TESTS=BOGOFILTER, SPAMICITY=0.99993752, VERSION=2011-08-29DATE: MON, 26 SEP 2011 21:52:26 +0300X-CLASSIFICATION: JUNK - AD HOC SPAM DETECTED (CODE = 73)

SPECIAL OFFER : VIAGRA ON SALE AT $1.38 !!!

COMPARE THE BEST ONLINE PHARMACIES TO BUY VIAGRA. ORDER VIAGRA ONLINE WITH HUGE DISCOUNT.MULTIPLE BENEFITS INCLUDE FREE SHIPPING, REORDER DISCOUNTS, BONUS PILLSHTTP://RXPHARMACYCVS.RU

N A I V E B AY E S F O R S PA M F I LT E R I N G

WORD1 WORD2 WORD3 WORD4 WORD6 WORD7

• General strategy: – estimate word probabilities in each class (spam/ham)

=> we'll return to this a bit later – use the estimates to compute the posterior probability

P(spam | message) where message is the contents (words) of a new message

– if P(spam | message) > 0.5, classify as spam

T H E S E M A N T I C S O F T H I S G R A P H : W O R D S A R E " C O N D I T I O N A L LY I N D E P E N D E N T O F E A C H O T H E R G I V E N T H E C L A S S "

N A I V E B AY E S

Random variables:– Class: spam/ham– Word1

– Word2

– ...

N A I V E B AY E S

Random variables:– Class: spam/ham– Word1

– Word2

– ...

N A I V E B AY E S

Inference:

1. P(spam) = 0.5 P(spam) P(Word1= 'viagra'| spam)2. P(spam | Word1= 'viagra') = -------------------------------------------------- P(Word1= 'viagra')

BAYES RULES!

N A I V E B AY E S

Inference:

P(Word1= 'viagra') = P(spam) P(Word1= 'viagra'| spam) + P(ham) P(Word1= 'viagra'| ham)

N A I V E B AY E S

Inference:

3. P(spam | Word1= 'viagra', Word2= 'is') = P(spam) P(Word1= 'viagra', Word2= 'is' | spam) ------------------------------------------------------------ P(Word1= 'viagra', Word2= 'is')

4. P(spam | Word1= 'viagra', Word2= 'is', Word3= 'algorithm') = P(spam) P(Word1= 'viagra', Word2= 'is', Word3= 'algorithm' | spam) -------------------------------------------------------------------------------------- P(Word1= 'viagra', Word2= 'is', Word3= 'algorithm')

N A I V E B AY E S

Avoiding the denominator:

P(spam | message) R = ------------------------ ⇔ P(spam | message) = R / (1+R)

P(ham | message)

P(spam) P(message | spam) / P(message) R = --------------------------------------------------- P(ham) P(message | ham) / P(message)

N A I V E B AY E S

Avoiding the denominator:

P(spam | message) R = ------------------------ ⇔ P(spam | message) = R / (1+R)

P(ham | message)

P(spam) P(message | spam) / P(message) P(spam) P(message | spam) R = --------------------------------------------------- = ----------------------------------- P(ham) P(message | ham) / P(message) P(ham) P(message | ham)

4. P(spam | Word1= 'viagra', Word2= 'is', Word3= 'algorithm') -------------------------------------------------------------------------- P(ham | Word1= 'viagra', Word2= 'is', Word3= 'algorithm') P(spam) P(Word1= 'viagra', Word2= 'is', Word3= 'algorithm' | spam) = -------------------------------------------------------------------------------------- P(ham) P(Word1= 'viagra', Word2= 'is', Word3= 'algorithm' | ham)

N A I V E B AY E S

Factorization and the naive Bayes assumption:

P(Word1= 'viagra', Word2= 'is', Word3= 'algorithm' | spam)= P(Word1= 'viagra' | spam) P(Word2= 'is' | spam) · P(Word3= 'algorithm' | spam)

• This is starting to look good because we can estimate terms like P(Word3= 'algorithm' | spam) can be estimated from a training data set...

• ...whereas estimating the probability of combinations of three or more words (n-grams) would require massive data sets

N A I V E B AY E S

Putting things together

• Here we use the fact that (ABC) / (DEF) = (A/D) (B/E) (C/F)

• A nice routine calculation that smoothly becomes a program

N A I V E B AY E S

Putting things together

1: spamicity(words, estimates):

2: R = estimates.prior_spam / estimates.prior_ham

3: for each w in words:

4: R = R * estimates.spam_prob(w) / estimates.ham_prob(w)

5: return R

• estimates.prior_spam = P(spam)estimates.prior_ham = P(ham) = 1 – P(spam)

• estimates.spam_prob(w) = P(Wordi= 'w'| spam)estimates.ham_prob(w) = P(Wordi = 'w' | ham)

• Recall that P(spam | message) = R / (1+R)

N A I V E B AY E S : I N T E R I M S U M M A R Y

• What we need to estimate from training data: – prior probabilities P(spam) and P(ham) – conditional probabilities for words P(Wordi= 'w'| spam)

and P(Wordi= 'w'| ham)

• The conditional independence assumption, aka. "naive Bayes assumption", implies that P(Wordi= 'w', Wordj= 'v'| spam) = P(Wordi= 'w' | spam) P(Wordi= 'v'| spam)

• The essential quantity is the ratio P(Wordi= 'w'| spam) / P(Wordi = 'w' | ham)

N A I V E B AY E S : I N T E R I M S U M M A R Y

• The cost function is asymmetric: – worse to misclassify ham as spam than vice versa – the decision to classify should be made only if

P(spam | message) > a, for some a > 0.5

• Under- and overflows: – a product with many terms tends to become either very

large or very small (close to zero) – can be solved by computing log(R) instead of R – log(AB) = log(A) + log(B), log(A/B) = log(A) – log(B) – at the end, get R by exp(log(R)) = R

Additional remarks

N A I V E B AY E S : E S T I M AT I O N

Estimating the word probabilities from training data:

1 MONEY ... 5 VIAGRA ... 10 IS ... 19 REPLICA 20 EMAIL 20 YOU 21 DATABASE 25 EMAILS 26 OF 31 TO 43 AND 48 THE TOTAL 2386

21 ALGORITHM ... 62 MONEY ... 2199 FOR 2492 THAT 2990 YOU 3141 IN 3160 I 3218 AND 3283 IS 3472 OF 3874 A 5442 TO 9196 THE TOTAL 283736

spam ham

0.04 %

0.21 %

0.42 %

0.80 % 0.84 % 0.84 % 0.88 % 1.05 % 1.09 % 1.30 % 1.80 % 2.01 %

0.01 %

0.02 %

0.78 % 0.88 % 1.05 % 1.11 % 1.11 % 1.13 % 1.16 % 1.22 % 1.37 % 1.92 % 3.24 %

spam ham

0.04 %

0.21 %

0.42 %

0.80 % 0.84 % 0.84 % 0.88 % 1.05 % 1.09 % 1.30 % 1.80 % 2.01 %

0.01 %

0.02 %

0.78 % 0.88 % 1.05 % 1.11 % 1.11 % 1.13 % 1.16 % 1.22 % 1.37 % 1.92 % 3.24 %

P(Wordi='money' | spam) 0.0004 --------------------------------- = --------- = 1.918 > 1 P(Wordi='money' | ham) 0.0002

P(Wordi='is' | spam) 0.0042 --------------------------- = --------- = 0.3622 < 1 P(Wordi='is' | ham) 0.0116

spam ham

0.04 %

0.21 %

0.42 %

0.80 % 0.84 % 0.84 % 0.88 % 1.05 % 1.09 % 1.30 % 1.80 % 2.01 %

0.01 %

0.02 %

0.78 % 0.88 % 1.05 % 1.11 % 1.11 % 1.13 % 1.16 % 1.22 % 1.37 % 1.92 % 3.24 %

P ( ' a l g o r i t h m ' | s p a m ) = 0 . 0 ? M a y l e a d t o a s i t u a t i o n w h e r e w e r a t i o i s 0 / 0 ! S o l u t i o n : R e p l a c e 0 b y, e . g . , 0 . 0 0 0 0 0 1 .

5) 6) 7) 8) 9) introduction to - university of …...naive bayes for spam filtering class word 1...

Documents

naive bayes document classification - university of...

classficação de texto e naive bayes

k-nearest neighbor & naive bayes

naive bayes classifier approach to word sense...

gaussian naive bayes classifier & logistic regression

3 naive bayes - virginia tech

naive bayes text classification · naive bayes text...

naive bayes and sentiment classification

bayes theorem and naive bayesclassifier

text classification using naive bayes

l11 naive bayes - virginia tech

naive bayes

naive bayes named entity recognizer

naive bayes document classification

spam filtering with naive bayes â€“ which naive bayes?

naive bayes - mran.microsoft.com

nearest neighbors and naive bayes

spam filtering with naive bayes – which naive...

13 text classification and naive bayes

naive bayes and...