summary of datamining on yelp review...

Summary of datamining on yelp review data

0 introduction

1. data

1.1 json format raw data

1.2 convert review text from json to table separated value(tsv) file

1.3 clean review text and convert to binary input

2. analysis

2.1 algorithm on text mining

2.2 experiment

3. future work

Yelp Dataset Challenge Yelp connects people to great local businesses. To help people find great local

businesses, Yelp engineers have developed an excellent search engine to sift through over 89 million reviews and

help people find the most relevant businesses for their everyday needs. Yelp is proud to introduce a deep dataset

for research minded academics from our wealth of data. If you’ve been looking for a rich set of data to train your

models on and use in publications, this is it. Tired of using the same standard datasets? Want some real world

relevance in your research project? This data is for you!

• How well can you guess a review's rating from its text alone?

• Can you take all of the reviews of a business and predict when it will be the most busy, or when the business

is open?

• Can you predict if a business is good for kids? Has WiFi? Has Parking?

• What makes a review useful, funny, or cool?

• Can you figure out which business a user is likely to review next?

• How much of a business's success is really just location, location, location?

• What businesses deserve their own subcategory (i.e., Szechuan or Hunan versus just “Chinese restaurants”),

and can you learn this from the review text?

• What are the differences between the cities in the dataset?

[ref: https://www.yelp.com/html/pdf/Yelp_Dataset_Challenge_Terms_round_7.pdf]

This study will use yelp review data set to solve the first question: How well can you

guess a review's rating from its text alone? The text mining experiment will follow the

Pang’s paper in 2002.

[ref: Pang, Bo; Lee, Lillian; Vaithyanathan, Shivakumar (2002). "Thumbs up? Sentiment

Classification using Machine Learning Techniques". Proceedings of the Conference on

Empirical Methods in Natural Language Processing (EMNLP). pp. 79–86.]

Goal:

1.1 json format raw data

yelp_academic_dataset_review.json

contain 2225213 reviews.

$ wc -l yelp_academic_dataset_review.json

2225213 yelp_academic_dataset_review.json

Review format:

{

'type': 'review',

'business_id': (encrypted business id),

'user_id': (encrypted user id),

'stars': (star rating, rounded to half-stars),

'text': (review text),

'date': (date, formatted like '2012-03-14'),

'votes': {(vote type): (count)},

}

Example:

{

"votes": {

"funny": 0,

"useful": 0,

"cool": 0

},

"user_id": "PUFPaY9KxDAcGqfsorJp3Q",

"review_id": "Ya85v4eqdd6k9Od8HbQjyA",

"stars": 4,

"date": "2012-08-01",

"text": "Mr Hoagie is an institution. Walking in, it does

seem like a throwback to 30 years ago, old fashioned menu board,

booths out of the 70s, and a large selection of food. Their speciality

is the Italian Hoagie, and it is voted the best in the area year after

year. I usually order the burger, while the patties are obviously

cooked from frozen, all of the other ingredients are very fresh.

Overall, its a good alternative to Subway, which is down the road.",

"type": "review",

"business_id": "5UmKMjUEUNdYWqANhGckJw"}

1. data

1.2 convert review text from json to table separated value(tsv) file

• Process json file with Hive. Import yelp_academic_dataset_review.json file into Hadoop and output null \000 separated

file as following:

hive>CREATE TABLE IF NOT EXISTS tall(str string);

hive> LOAD DATA LOCAL INPATH '/home/gpyu/HotelBigdata/yelp/yelpDatasets/yelp_academic_dataset_review.json'

OVERWRITE INTO TABLE tall;

hive> INSERT OVERWRITE LOCAL DIRECTORY '/home/gpyu/hadoopOUT/' ROW FORMAT DELIMITED FIELDS

TERMINATED BY '\000' SELECT GET_JSON_OBJECT(tall.str, '$.stars'), GET_JSON_OBJECT(tall.str, '$.user_id'),

GET_JSON_OBJECT(tall.str, '$.review_id'), GET_JSON_OBJECT(tall.str, '$.date'), GET_JSON_OBJECT(tall.str, '$.type'),

GET_JSON_OBJECT(tall.str, '$.business_id'), GET_JSON_OBJECT(tall.str, '$.text') FROM tall;

• Convert null separated file to table separated value (tsv) file with program totsv.c.

totsv.c

*GOAL: preprocess the raw data from HIVE to standard tsv (table separated value) file.

*The raw data is n columns null (0x00) separated file. The first column must be one char

*and other column should more one char

*

*first step: replace the newline (0x0A), carriage return (0x0D) and tab (0x09) with space (0x20)

*second step: delimiter for the row.

change the char to newline (0x0A) between tuples. [...0x0A xx 0x00...], xx is decimal 1-5

*third step: delimiter for the column.replace null (0x00) with tab (0x09)

null separated file

Table separated value (tsv) file: review_all_col7.tsv

1.3 clean review text and convert to binary input

To implement these machine learning algorithms on our document data, used the following

standard bag-of-features framework. Let {f1 , . . . , fm } be a predefined set of m features that

can appear in a document; examples include the word “still” or the bigram “really stinks”. Let

ni(d) be the number of times fi occurs in document d. Then, each document d is represented by

the document vector d := (n1(d), n2(d), . . . , nm(d)).

[ref: Pang, Bo; Lee, Lillian; Vaithyanathan, Shivakumar (2002). "Thumbs up? Sentiment

Classification using Machine Learning Techniques". Proceedings of the Conference on

Empirical Methods in Natural Language Processing (EMNLP). pp. 79–86.]

• ToBinary.java convert review text to binary input (frequency and

presence)

/** ToBinary.java

*AUTHOR: GUANPING YU

*DATE: 2016-08-05

*VERSION: 0.5

* feature with the number of times occurs in total document > 1%

number of observations can be selected

*GOAL: clean the file, count the word, add negation not_ tag and count

the word again

* Select features by intersection wordcount and dictionary

* transfer text to binary using freqency of the features

*Tag the negation: add not_ to every word between a negation word

("not", isn't, didn't, etc)

*/

Dictionary: dic_pos_neg

A list of positive and negative opinion words or sentiment words for

English (6789 words).

http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

*clean file: replace punctuation with space (not include single quote '')

Dec Hex Binary Char Description

33 21 00100001 ! exclamation mark

34 22 00100010 " double quote

35 23 00100011 # number

36 24 00100100 $ dollar

37 25 00100101 % percent

38 26 00100110 & ampersand

40 28 00101000 ( left parenthesis

41 29 00101001 ) right parenthesis

42 2A 00101010 * asterisk

43 2B 00101011 + plus

44 2C 00101100 , comma

45 2D 00101101 - minus

46 2E 00101110 . period

47 2F 00101111 / slash

58 3A 00111010 : colon

59 3B 00111011 ; semicolon

60 3C 00111100 < less than

61 3D 00111101 = equality sign

62 3E 00111110 > greater than

63 3F 00111111 ? question mark

64 40 01000000 @ at sign

91 5B 01011011 [ left square bracket

92 5C 01011100 \ backslash

93 5D 01011101 ] right square bracket

94 5E 01011110 ^ caret / circumflex

95 5F 01011111 _ underscore

96 60 01100000 ` grave / accent*

Review text

Binary input (frequency)

2. analysis2.1 algorithm on text mining

The classification algorithms were used from R caret package.

[ref: https://cran.r-project.org/web/packages/caret/]

• Naive Bayes (method = 'nb'). For classification using package klaR

• Learning Vector Quantization (method = 'lvq'). For classification using package class

• Neural Network (method = 'nnet'). For classification and regression using package nnet

• Neural Networks with Feature Extraction (method = 'pcaNNet')

For classification and regression using package nnet

• Support Vector Machines with Linear Kernel (method = 'svmLinear')

For classification and regression using package kernlab

• Support Vector Machines with Linear Kernel (method = 'svmLinear2')

For classification and regression using package e1071

• Linear Support Vector Machines with ClassWeights (method = 'svmLinearWeights')

For classification using package e1071

• k-Nearest Neighbors (method = 'knn')

For classification and regression

• Support Vector Machines with Radial Basis Function Kernel (method = 'svmRadial')

For classification and regression using package kernlab

• Stochastic Gradient Boosting (method = 'gbm')

For classification and regression using packages gbm and plyr

algorithm on text mining--continue

2.2 experiment and results

• 2k dataset contain only star1 and star5, 1k per each. feature frequency (the number of times fi occurs in whole dataset.)

(01 >= 1; 04 >= 4; 20 >= 20 (1% number of observations)). Discuss the number of features and the prediction accuracy

Model: svmLinear

File: review_2k_binfreq_svm.RData

Confidence Level: 0.95

Accuracy Kappa

freq04

freq20

freq01

0.70 0.75 0.80 0.85 0.90

Accuracy

0.70 0.75 0.80 0.85 0.90

Kappa

Discussion:

The best one is freq01 which has 1767 features, freq20 has 150 features. The accuracy for these two set are almost same (89.3% vs

87.7%). Freq04 has 641 features and accuracy is 86.4%. In order to save computation time, the number of features fi in the wh

ole dataset will be >= 1% number of observations.

• 2k dataset contain only star1 and star5, 1k per each. Frequency of features vs presence of features (freq vs pres)

Model: svmLinear, lvq and gbm

File: review_2k_star2_bin_svm_gbm_lvq.RData


Accuracy Kappa

lvqFreq

lvqPres

svmPres

gbmFreq

svmFreq

gbmPres

0.65 0.70 0.75 0.80 0.85

Accuracy

0.65 0.70 0.75 0.80 0.85

Kappa

Discussion: the frequency and presence of features show similar accuracy for gbm, svmLinear and lvq model. Gbm and

svmLinear models accuracy (88%) higher than lvq model (84%).

• 2k dataset contain only star1 and star5, 1k per each. Another two classification model.

Model: nnet and nb

File: review_2k_star2_bin_nb_nnt.RData


Accuracy Kappa

nbFreq

nbPres

PcaNNetFreq

PcaNNetPres

nnetFreq

nnetPres

0.0 0.2 0.4 0.6 0.8

Accuracy

0.0 0.2 0.4 0.6 0.8

Kappa

Discussion: for naïve Bayes (nb) model both presence and frequency of features showed lower accuracy (freq 51.7%, pres

68.4%). Neural network (nnet) and Neural Networks with Feature Extraction (PcaNNet) showed higher accuracy(87.3~88.1%),

similar with svm and gbm models (88%).

• 2k dataset (a-d four different 2k samples) contain only star1 and star5, 1k per each. To evaluate the model stability, we ran

dom selected four group 2k data from yelp review.

File: review_2k_binfreq_abcd_svm.RData

Model: svmLinear


Accuracy Kappa

a_2k

c_2k

b_2k

d_2k

0.75 0.80 0.85 0.90

Accuracy

0.75 0.80 0.85 0.90

Kappa

Discussion: compare with previous 2k dataset, no matter we select different sample. The prediction accuracies are same for the

same model.

• 5k dataset contain star1 to star5, 1k per each. To evaluate the model for 5 levels (1-5).

File: review_5k_bin_svm_gbm_lvq.RData

Model: svmLinear, gbm and lvq


Accuracy Kappa

lvqFreq

lvqPres

gbmPres

gbmFreq

svmFreq

svmPres

0.20 0.25 0.30 0.35 0.40 0.45

Accuracy

0.20 0.25 0.30 0.35 0.40 0.45

Kappa

Discussion: When level increase (from level 2 to level 5), the model prediction accuracy decrease from 88% to 44%.

• 5k dataset contain star1 to star5, 1k per each. Assign star2 to star1, star4 to star5. To evaluate the model for 3 levels (

1, 3, 5).

File: review_5k_star3_bin_svm_gbm_lvq.RData



Accuracy Kappa

lvqFreqL3

lvqPresL3

svmPresL3

svmFreqL3

gbmFreqL3

gbmPresL3

0.4 0.5 0.6 0.7

Accuracy

0.4 0.5 0.6 0.7

Kappa

Discussion: When level 5 decrease to level 3, the prediction accuracy increase. svmLinear model from 44% to 66%.

• 5k dataset contain star1 to star5, 1k per each. Assign star2 and star3 to star1, star4 to star5. To evaluate the model for

2 levels (1, 5).

File: review_5k_star2a_bin_svm_gbm_lvq.RData



Accuracy Kappa

lvqPres2a

lvqFreq2a

gbmFreq2a

gbmPres2a

svmFreq2a

svmPres2a

0.4 0.5 0.6 0.7 0.8

Accuracy

0.4 0.5 0.6 0.7 0.8

Kappa

Discussion: The level decrease the accuracy increase. Level 3 to level 2, the accuracy from 66% to 79%.

• 5k dataset contain star1 to star5, 1k per each. Assign star2 to star1, star4 and star3 to star5. To evaluate the star3 to

star1 or star5.

File: review_5k_star2b_bin_svm_gbm_lvq.RData



Accuracy Kappa

lvqFreq2b

lvqPres2b

gbmPres2b

gbmFreq2b

svmFreq2b

svmPres2b

0.4 0.5 0.6 0.7 0.8

Accuracy

0.4 0.5 0.6 0.7 0.8

Kappa

Discussion: There is no significant difference when assign star3 to star1 (2a 79%) or star5 (2b 78%) for svmLinear model.

3. Future work

• Read around 200 to 500 review to get further information from the details. Pattern,

differences, feature selection.

• Read recent literature about new models on the text mining.

• How to use the text mining result for the recommendation system.

summary of datamining on yelp review...

Documents