2013 IEEE International Conference on Big Data
Scalable Sentiment Classification for Big DataAnalysis Using Naive Bayes Classifier
Bingwei Liu, Erik Blasch, Yu Chen, Dan Shen and Genshe Chen
outline
✤ introduction
✤ Naive Bayes Classification
✤ implementation of Naive Bayes in hadoop
✤ experimental study
introduction
A typical method to obtain valuable information is to extract the sentiment or opinion from a message
In this paper, it aim to evaluate the scalability ofNaive Bayes classifier (NBC) in large datasets
introduction
NBC is able to scale up to analyze the sentiment of millions movie reviews with increasing throughput
the accuracy of NBC is improved and approaches 82%
Naive Bayes Classification
naive Bayes classifiers is simple probabilistic classifiers based on applying Bayes' theorem with
strong (naive) independence assumptions between the features
a popular method for text categorization,( the problem of judging documents as belonging to one
category)
Naive Bayes Classification
prior probability :
posterior probability:
P(A)
P(A|B)
Naive Bayes Classification
P(POS|excellent,terrible) = P(POS) x P(excellent,terrible|POS)
P(excellent,terrible)
P(POS|d1) = P(POS) x P(d1|POS)
P(d1)
Bayes' theorem
Naive Bayes Classification
P(POS|excellent,terrible) = P(POS) x P(excellent,terrible|POS)
P(excellent,terrible)
P(excellent,terrible|POS) P(excellent|POS) x P(terrible|POS)
independent
P(POS|excellent,terrible) = P(POS) x P(excellent|POS) x P(terrible|POS)
P(excellent,terrible)
Naive Bayes Classification
classes excellent terrible
d1 POS 5 1
d2 NEG 2 6
P(POS|excellent,terrible) = P(POS) x P(excellent|POS) x P(terrible|POS)
P(excellent,terrible)
P(POS|excellent,terrible) =
P(NEG|excellent,terrible) =
d3 : (excellent,8),(terrible,2)
56
( )16
( )
12
828
( ) 268
( )x x
12
856
( ) 216
( )x x
Naive Bayes Classification
P(POS|excellent,terrible) =
P(NEG|excellent,terrible) =
d3 : (excellent,8),(terrible,2)12
856
( ) 216
( )x x12
828
( ) 268
( )x x
0.00323011165
0.00000429153
d3 is POS
Naive Bayes Classification
12
856
( ) 216
( )x x
Naive Bayes Classification
N is the total number of documents,Nc is the number of documents in class c
Nwi is the frequency of a word wi in class c.
implementation of Naive Bayes in hadoop
pre-processing raw dataset
implementation of Naive Bayes in hadoop
1000 positive and 1000 negative review
implementation of Naive Bayes in hadoop
(word,posSum,negSum)
the words frequency in all positive,negative document
(excellent,1000,10)
implementation of Naive Bayes in hadoop
(excellent,1000,10) (excellent,20,5)
(word,posSum,negSum) (word,count,docID)
(docID,count,word,posSum,negSum)
(5,20,excellent,1000,10)
implementation of Naive Bayes in hadoop
(5,10,excellent,20,5)
(5,2,terrible,5,20)
(5,pos,true)
(docID,predict,correct)
(6,neg,false)
(docID,count,word,posSum,negSum)
10xlog(20)+2xlog(5)
10xlog(5)+2xlog(20)
experimental study
one name node and six data nodes. they allocate each VM two virtual CPU and 4GB of memory
7 nodes
a Dell server with 12 Intel Xeon E5-2630 2.3GHz cores and 32G memory
use Xen CloudPlatform (XCP) 1.6 as the hypervisor
experimental study
training data
experimental study