5) 6) 7) 8) 9) introduction to - university of …...naive bayes for spam filtering class word 1...

Post on 12-Jan-2020

20 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

v=1v=–1 v=–1 v=–1

v=–1 v=1

v=–1

optimaalin

en peli

O O

O

XX

O O

O

XXX X

O O

OX

XX X

OO

O

X

XX X

O

O OX

XX

O

O

O

X

XXO

O

O

O

O

X

XX XO

XO

XO

OX

2) 3)

5) 6) 7) 8) 9)

12)

O

O O

OX

XX XO

11)

O O

O

OX

XXO

XO

13)

OO

I N T R O D U C T I O N T O A R T I F I C I A L I N T E L L I G E N C E

D A TA 1 5 0 0 1

E P I S O D E 5 : P R O B A B I L I S T I C I N F E R E N C E

1. B AY E S R U L E

2. N A I V E B AY E S C L A S S I F I E R

3. S PA M F I LT E R

T O D A Y ’ S M E N U

H O W T O M A K E M O N E Y

H O W T O M A K E M O N E Y

FROM: "MARGARETTA NITA" <MARGUERITESEBRINA@WMLE.COM>SUBJECT: SPECIAL OFFER : VIAGRA ON SALE AT $1.38 !!!X-BOGOSITY: YES, TESTS=BOGOFILTER, SPAMICITY=0.99993752, VERSION=2011-08-29DATE: MON, 26 SEP 2011 21:52:26 +0300X-CLASSIFICATION: JUNK - AD HOC SPAM DETECTED (CODE = 73)

SPECIAL OFFER : VIAGRA ON SALE AT $1.38 !!!

COMPARE THE BEST ONLINE PHARMACIES TO BUY VIAGRA. ORDER VIAGRA ONLINE WITH HUGE DISCOUNT.MULTIPLE BENEFITS INCLUDE FREE SHIPPING, REORDER DISCOUNTS, BONUS PILLSHTTP://RXPHARMACYCVS.RU

N A I V E B AY E S F O R S PA M F I LT E R I N G

CLASS

WORD1 WORD2 WORD3 WORD4 WORD6 WORD7

• General strategy: – estimate word probabilities in each class (spam/ham)

=> we'll return to this a bit later – use the estimates to compute the posterior probability

P(spam | message) where message is the contents (words) of a new message

– if P(spam | message) > 0.5, classify as spam

T H E S E M A N T I C S O F T H I S G R A P H : W O R D S A R E " C O N D I T I O N A L LY I N D E P E N D E N T O F E A C H O T H E R G I V E N T H E C L A S S "

N A I V E B AY E S

Random variables:– Class: spam/ham– Word1

– Word2

– ...

Distributions:– P(Class = spam) = 0.5– P(Wordi = 'viagra' | spam)=0.002– P(Wordi = 'viagra' | ham)=0.0001– P(Wordi = '$' | spam)=0.005– P(Wordi = '$' | ham)=0.0002– P(Wordi = 'is' | spam)=0.002– P(Wordi = 'is' | ham)=0.002– P(Wordi = 'algorithm' | spam)=0.0001– P(Wordi = 'algorithm' | ham)=0.002– ...

N A I V E B AY E S

Random variables:– Class: spam/ham– Word1

– Word2

– ...

Distributions:– P(Class = spam) = 0.5– P(Wordi = 'viagra' | spam)=0.002– P(Wordi = 'viagra' | ham)=0.0001– P(Wordi = '$' | spam)=0.005– P(Wordi = '$' | ham)=0.0002– P(Wordi = 'is' | spam)=0.002– P(Wordi = 'is' | ham)=0.002– P(Wordi = 'algorithm' | spam)=0.0001– P(Wordi = 'algorithm' | ham)=0.002– ...

N A I V E B AY E S

Inference:

1. P(spam) = 0.5 P(spam) P(Word1= 'viagra'| spam)2. P(spam | Word1= 'viagra') = -------------------------------------------------- P(Word1= 'viagra')

BAYES RULES!

N A I V E B AY E S

Inference:

1. P(spam) = 0.5 P(spam) P(Word1= 'viagra'| spam)2. P(spam | Word1= 'viagra') = -------------------------------------------------- P(Word1= 'viagra')

P(Word1= 'viagra') = P(spam) P(Word1= 'viagra'| spam) + P(ham) P(Word1= 'viagra'| ham)

N A I V E B AY E S

Inference:

1. P(spam) = 0.5 P(spam) P(Word1= 'viagra'| spam)2. P(spam | Word1= 'viagra') = -------------------------------------------------- P(Word1= 'viagra')

3. P(spam | Word1= 'viagra', Word2= 'is') = P(spam) P(Word1= 'viagra', Word2= 'is' | spam) ------------------------------------------------------------ P(Word1= 'viagra', Word2= 'is')

4. P(spam | Word1= 'viagra', Word2= 'is', Word3= 'algorithm') = P(spam) P(Word1= 'viagra', Word2= 'is', Word3= 'algorithm' | spam) -------------------------------------------------------------------------------------- P(Word1= 'viagra', Word2= 'is', Word3= 'algorithm')

N A I V E B AY E S

Avoiding the denominator:

P(spam | message) R = ------------------------ ⇔ P(spam | message) = R / (1+R)

P(ham | message)

P(spam) P(message | spam) / P(message) R = --------------------------------------------------- P(ham) P(message | ham) / P(message)

N A I V E B AY E S

Avoiding the denominator:

P(spam | message) R = ------------------------ ⇔ P(spam | message) = R / (1+R)

P(ham | message)

P(spam) P(message | spam) / P(message) P(spam) P(message | spam) R = --------------------------------------------------- = ----------------------------------- P(ham) P(message | ham) / P(message) P(ham) P(message | ham)

4. P(spam | Word1= 'viagra', Word2= 'is', Word3= 'algorithm') -------------------------------------------------------------------------- P(ham | Word1= 'viagra', Word2= 'is', Word3= 'algorithm') P(spam) P(Word1= 'viagra', Word2= 'is', Word3= 'algorithm' | spam) = -------------------------------------------------------------------------------------- P(ham) P(Word1= 'viagra', Word2= 'is', Word3= 'algorithm' | ham)

N A I V E B AY E S

Factorization and the naive Bayes assumption:

P(Word1= 'viagra', Word2= 'is', Word3= 'algorithm' | spam)= P(Word1= 'viagra' | spam) P(Word2= 'is' | spam) · P(Word3= 'algorithm' | spam)

• This is starting to look good because we can estimate terms like P(Word3= 'algorithm' | spam) can be estimated from a training data set...

• ...whereas estimating the probability of combinations of three or more words (n-grams) would require massive data sets

N A I V E B AY E S

Putting things together

P(spam |'viagra', 'is', 'algorithm') ------------------------------------------- P(ham | 'viagra', 'is', 'algorithm') P(spam) P('viagra' | spam) P('is' | spam) P('algorithm' | spam) = ------------------------------------------------------------------------------- P(ham) P('viagra' | ham) P('is' | ham) P('algorithm' | ham) P(spam) P('viagra' | spam) P('is' | spam) P('algorithm' | spam) = ----------- · ---------------------- · ----------------- · -------------------------- P(ham) P('viagra' | ham) P('is' | ham) P('algorithm' | ham)

• Here we use the fact that (ABC) / (DEF) = (A/D) (B/E) (C/F)

• A nice routine calculation that smoothly becomes a program

N A I V E B AY E S

Putting things together

1: spamicity(words, estimates):

2: R = estimates.prior_spam / estimates.prior_ham

3: for each w in words:

4: R = R * estimates.spam_prob(w) / estimates.ham_prob(w)

5: return R

• estimates.prior_spam = P(spam)estimates.prior_ham = P(ham) = 1 – P(spam)

• estimates.spam_prob(w) = P(Wordi= 'w'| spam)estimates.ham_prob(w) = P(Wordi = 'w' | ham)

• Recall that P(spam | message) = R / (1+R)

N A I V E B AY E S : I N T E R I M S U M M A R Y

• What we need to estimate from training data: – prior probabilities P(spam) and P(ham) – conditional probabilities for words P(Wordi= 'w'| spam)

and P(Wordi= 'w'| ham)

• The conditional independence assumption, aka. "naive Bayes assumption", implies that P(Wordi= 'w', Wordj= 'v'| spam) = P(Wordi= 'w' | spam) P(Wordi= 'v'| spam)

• The essential quantity is the ratio P(Wordi= 'w'| spam) / P(Wordi = 'w' | ham)

N A I V E B AY E S : I N T E R I M S U M M A R Y

• The cost function is asymmetric: – worse to misclassify ham as spam than vice versa – the decision to classify should be made only if

P(spam | message) > a, for some a > 0.5

• Under- and overflows: – a product with many terms tends to become either very

large or very small (close to zero) – can be solved by computing log(R) instead of R – log(AB) = log(A) + log(B), log(A/B) = log(A) – log(B) – at the end, get R by exp(log(R)) = R

Additional remarks

N A I V E B AY E S : E S T I M AT I O N

Estimating the word probabilities from training data:

1 MONEY ... 5 VIAGRA ... 10 IS ... 19 REPLICA 20 EMAIL 20 YOU 21 DATABASE 25 EMAILS 26 OF 31 TO 43 AND 48 THE TOTAL 2386

21 ALGORITHM ... 62 MONEY ... 2199 FOR 2492 THAT 2990 YOU 3141 IN 3160 I 3218 AND 3283 IS 3472 OF 3874 A 5442 TO 9196 THE TOTAL 283736

spam ham

0.04 %

0.21 %

0.42 %

0.80 % 0.84 % 0.84 % 0.88 % 1.05 % 1.09 % 1.30 % 1.80 % 2.01 %

0.01 %

0.02 %

0.78 % 0.88 % 1.05 % 1.11 % 1.11 % 1.13 % 1.16 % 1.22 % 1.37 % 1.92 % 3.24 %

N A I V E B AY E S : E S T I M AT I O N

Estimating the word probabilities from training data:

1 MONEY ... 5 VIAGRA ... 10 IS ... 19 REPLICA 20 EMAIL 20 YOU 21 DATABASE 25 EMAILS 26 OF 31 TO 43 AND 48 THE TOTAL 2386

21 ALGORITHM ... 62 MONEY ... 2199 FOR 2492 THAT 2990 YOU 3141 IN 3160 I 3218 AND 3283 IS 3472 OF 3874 A 5442 TO 9196 THE TOTAL 283736

spam ham

0.04 %

0.21 %

0.42 %

0.80 % 0.84 % 0.84 % 0.88 % 1.05 % 1.09 % 1.30 % 1.80 % 2.01 %

0.01 %

0.02 %

0.78 % 0.88 % 1.05 % 1.11 % 1.11 % 1.13 % 1.16 % 1.22 % 1.37 % 1.92 % 3.24 %

P(Wordi='money' | spam) 0.0004 --------------------------------- = --------- = 1.918 > 1 P(Wordi='money' | ham) 0.0002

P(Wordi='is' | spam) 0.0042 --------------------------- = --------- = 0.3622 < 1 P(Wordi='is' | ham) 0.0116

N A I V E B AY E S : E S T I M AT I O N

Estimating the word probabilities from training data:

1 MONEY ... 5 VIAGRA ... 10 IS ... 19 REPLICA 20 EMAIL 20 YOU 21 DATABASE 25 EMAILS 26 OF 31 TO 43 AND 48 THE TOTAL 2386

21 ALGORITHM ... 62 MONEY ... 2199 FOR 2492 THAT 2990 YOU 3141 IN 3160 I 3218 AND 3283 IS 3472 OF 3874 A 5442 TO 9196 THE TOTAL 283736

spam ham

0.04 %

0.21 %

0.42 %

0.80 % 0.84 % 0.84 % 0.88 % 1.05 % 1.09 % 1.30 % 1.80 % 2.01 %

0.01 %

0.02 %

0.78 % 0.88 % 1.05 % 1.11 % 1.11 % 1.13 % 1.16 % 1.22 % 1.37 % 1.92 % 3.24 %

P ( ' a l g o r i t h m ' | s p a m ) = 0 . 0 ? M a y l e a d t o a s i t u a t i o n w h e r e w e r a t i o i s 0 / 0 ! S o l u t i o n : R e p l a c e 0 b y, e . g . , 0 . 0 0 0 0 0 1 .

top related